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GENOME PROBING 
USING MICRO ARRAYS 




Chapter 1 



INTRODUCTION 



After genome sequencing, microarray technology has emerged as a 
widely used platform for genomic studies in the life sciences. DNA mi- 
croarrays have orderly arrangements of nucleic acid spots at high den- 
sity. Many research studies have demonstrated the general usefulness of 
genome probing using microarrays. While genomics aims to give biol- 
ogists an inventory of all genes used to assemble life forms, microarray 
technology provides high-throughput measurements in molecular biol- 
ogy, yields information for the reconstruction of complex gene control 
networks, and offers a panoramic view of the consequences of controlling 
gene transcriptions. 

Microrray technology provides a systematic way to survey deoxyri- 
bonucleic acid (DNA) and ribonucleic acid (RNA) variation. DNA mi- 
croarray chips have revolutionized genetic research in the same way that 
silicone chips revolutionized the computer industry and applications 1 . 
Microarray technologies allow the transcription levels of thousands of 
genes to be measured simultaneously. Understanding biological systems 
with thousands of genes will require organizing similar parts by their 
properties. Methods to group genes with similar expression patterns 
have proved useful in identifying genes that contribute to common func- 
tions or genes that are likely to be co-regulated 2 . The hypothesis that 
many human diseases may be accompanied by specific changes in gene 
expression has generated much interest in gene expression monitoring 
at the genome level using arrays. Microarray gene expression studies 
open up fresh avenues of cancer class discovery and class prediction 3 . 
Microarrays can be used in the determination of prognosis in histologi- 
cally similar tumors with variable tendency to recur or spread and in the 
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classification of tumors of uncertain histotype or tissue origin. Large- 
scale gene expression analysis can also increase the depth of diagnostic 
and drug-effect profiling. By comparing gene expression in normal and 
abnormal cells, microarrays can accelerate the discovery of key biologi- 
cal processes for therapeutic targeting. Gene-expression profiling using 
microarrays permits a simultaneous analysis of multiple markers 4 . Mi- 
croarrays can be used to screen for polymorphisms within the population 
that may protect against or predispose to disease 5 . High-throughput 
microarrays can be used as a screen for early detection of disseminated 
tumor cells in peripheral blood 6 . DNA microarrays have been used to 
identify two classes of familial BRCAx breast cancers that differ in their 
expression of a large number of genes 7 . Arrays have also been used 
in comparing expression profiles of chronic lymphocytic leukemia speci- 
mens with and without immunoglobulin gene mutations 8 , among many 
other applications. 

While simultaneous measurement of thousands of gene expression lev- 
els provides a potential source of profound knowledge, success of the 
microarray technology depends on the precision of the measurements 
and on the integration of computational tools for data mining, visual- 
ization, and statistical modeling. With this technology, the expression 
of thousands of genes are measured in parallel, and the data obtained 
from image analysis are inherently noisy. In Chapter 4 we show that 
there are many sources of variation in microarray experiments. To ob- 
tain reliable gene expression data, experimental procedures need to be 
rigorously controlled to minimize noise and extraneous variation. Inter- 
nal controls are needed on the arrays to allow for possible errors such as 
imperfect hybridization and repetitive sequences. 

With the abundance of data produced from microarray studies. Lan- 
der (1996) pointed out that the greatest challenge is analytical. The 
impact of microarray technology on biology will depend heavily on data 
mining and statistical analysis. A sophisticated data-mining and analyt- 
ical tool is needed to correlate all of the data obtained from the arrays, 
to group them in a meaningful way, and to perform statistical analysis 
in order to investigate hypotheses of interest. Experimental design and 
statistical methods provide powerful analytical tools to biologists for the 
study of living systems. Through statistical analysis and the graphical 
display of clustering and classification results, microarray experiments 
allow biologists to assimilate and explore the data in a natural and in- 
tuitive manner. 

The challenge to statisticians is the nature of the microarray data. 
Instead of having large numbers of sample observations for a few vari- 
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ables, microarray data usually involve thousands of gene variables but 
few specimen samples. Microarray experiments raise numerous statisti- 
cal and computational questions in diverse areas such as image process- 
ing, cluster analysis, machine learning, discriminant analysis, principal 
component analysis, multidimensional scaling, analysis of variance mod- 
els, random effects models, multiplicative models, multiple testing, mod- 
els with measurement errors, models to handle missing values, mixture 
models, Bayesian methods, and sample size and power determination. 

The analysis of microarray data is a research field that is evolving. 
The contribution of this book is to provide general readers with an inte- 
grated presentation of various topics on analyzing microarray data. With 
the modest aim of providing an introduction to microarray technology 
and exploring different analytical methods that have been considered, 
the book is organized into four parts. We begin in Part I by providing 
the needed background knowledge about array technology. In order to 
familiarize readers with the necessary genetic terminology, basic genetic 
concepts are briefly reviewed in Chapter 2. In Chapter 3, we introduce 
different types of microarray platforms and review the basic procedures 
involved in microarray experiments. Array data measured from differ- 
ent platforms are discussed in Chapter 3. Important problems involving 
array data variation, background correction, and normalization are dis- 
cussed in Chapters 4 to 6. The main focus of this book begins with Part 
II, where we provide a systematic presentation of statistical issues and 
methods for analyzing microarray gene expression data. We give a basic 
description of different types of experimental designs and discuss the ad- 
vantages and disadvantages of some common designs used in microarray 
studies. The methods of analysis of variance for fixed effects and random 
effects models are presented in Chapter 10. Because of the large number 
of genes involved, the issue of multiple testing discussed in Chapter 11 
should not be ignored in analyzing array data. The useful technique of 
permutation tests is considered in Chapter 12. Bayesian methods for 
analyzing array data are explored in Chapter 13. We present power and 
sample size considerations for microarray studies in Chapter 14. Related 
research topics not described in this book that require further investi- 
gation are also mentioned. Although clustering methods were used in 
analyzing array data in earlier literature, we present clustering methods 
after statistical methods, in part because clustering using model-based 
normalization might provide more accurate results. Various unsuper- 
vised learning methods are discussed in Part III. In Part IV we consider 
supervised learning methods. 
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Chapter 2 



DNA, RNA, PROTEIN, AND 
GENE EXPRESSION 



This chapter provides a basic understanding of the background of 
microarray experiments for readers who are not familiar with molecu- 
lar biology. Further reading is encouraged, as an understanding of the 
underlying biology is required to make correct decisions in statistical 
analysis. For a more detailed description of genes and genetic analysis, 
see Lewin 1 (2000), Cooper 2 (2000), Griffiths et al. 3 (1999), Lehninger et 
al. 4 (2000), Griffiths et al. 5 (2000), Alberts et al. 6 (2002), and Dale and 
Schantz 7 (2002). Readers with a good knowledge of biology are advised 
to skip directly to Chapter 3 on Microarray Technologies or Chapter 4 
on Inherent Variability in Microarray Data. 

2.1. The Molecules of Life 

A cell is the minimal unit of life. There are a multitude of specific 
chemical transformations that not only provide the energy needed by 
a cell, but also coordinate all of the events and activities within that 
cell. The life process involves a wide array of molecules ranging from 
water to small organic compounds (e.g., fatty acids and sugars), and 
macromolecules (DNA, proteins, and polysaccharides) that define the 
structure of the cells. Macromolecules control and govern most of the 
activities of life. Deoxyribonucleic acid (DNA) molecules store informa- 
tion about the structure of macromolecules, allowing them to be made 
precisely according to cells’ specifications and needs. 

DNA is a very stable molecule that forms the “blueprint” of an or- 
ganism. The DNA structure encodes information as a sequence of chem- 
ically linked molecules that can be read by the cellular machinery and 
guides the construction of the linear arrangements of protein building 
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blocks, which eventually fold to form functional proteins. Molecular 
biology deals with how information is stored and converted to all the 
components and interactions that make up a living organism. Each cell 
contains a complete copy of its genetic material in the form of DNA 
molecules. The DNA can be copied and passed on to the cell’s progeny 
through a mechanism called replication. The genetic information can 
also be copied as a transportable “working copy” composed of ribonu- 
cleic acid (RNA) molecules, which are closely related to DNA. This 
process is called transcription. The RNA is transferred to a machinery 
that synthesizes protein molecules based on the information carried by 
the RNA. This process is called translation. The process sequence is 
illustrated by the following chart. 



nvA transcription translation , . 

DNA — » RNA — » protein 

What has just been described is the central dogma of molecular biol- 
ogy that formulates how information is stored and converted to all the 
components and interactions that build up a living organism. 

Proteins are the most functionally versatile of the life molecules. Being 
the “work horses” or “machines” of a cell, proteins catalyze an extraor- 
dinarily wide variety of chemical reactions and also serve as the building 
blocks of cellular structures. They are the building blocks of muscles, 
skin, and hair, as well as the enzymes that catalyze and control all 
chemical reactions in an organism, ranging from food digestion to nerve 
impulses and the components that are responsible for DNA replication, 
transcription, and translation. 

In the following sections we will discuss the building blocks and the 
higher-order structure of the macromolecules of life. 

2.2. Genes 

Genes are the units of the DNA sequence that control the identifiable 
hereditary traits of an organism. A gene can be defined as a segment of 
DNA that specifies a functional RNA. The total set of genes carried by an 
individual or a cell is called its genome. The genome defines the genetic 
construction of an organism or cell, or the genotype. The phenotype, 
on the other hand, is the total set of characteristics displayed by an 
organism under a particular set of environmental factors. The outward 
appearance of an organism (phenotype) may or may not directly reflect 
the genes that are present (genotype). Today the complete genome 
sequences of several species are known, including several bacteria, yeasts, 
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and humans. With microarray technology we can study the expression of 
all the genes in an organism simultaneously. Such genome-wide studies 
will help to uncover and decipher cellular processes from a completely 
new perspective. 

2.3. DNA 

Except for some viruses, the genetic material of all known organisms 
consists of one or more long molecules of deoxyribonucleic acid (DNA). 
The chemical components of the DNA molecule dictate the inherent 
properties of a species. DNA is made up of chains of chemical build- 
ing blocks called nucleotides. Each nucleotide consists of a phosphate 
group, a deoxyribose sugar molecule, and one of four different nitroge- 
nous bases usually referred to by their initial letters: guanine (G), cyto- 
sine (C), adenine (A), or thymine (T). Genetic information is encoded 
in DNA by the sequence of these nucleotides. The information stored 
in the sequence of nucleotides in terms of the four nitrogenous bases is 
analogous to a long word in a four-letter alphabet. 

The carbons in the deoxyribose sugar group of a nucleotide are as- 
signed numbers followed by a prime symbol (1', 2', etc.). In DNA, the 
nucleotides are connected to each other via a link of the 5' hydroxyl phos- 
phate group of one pentose ring of the deoxyribose sugar to the 3' OH 
group of the next pentose ring. The chemical connections between the 
repeating sugar and phosphate groups are called phosphodiester bonds. 
With one 5' end and the other 3' end, each chain is said to have polarity. 
It is conventional to write nucleic acid sequences in the 5' — + 3' direc- 
tion. DNA forms a double helix of two intertwined chains (strands) of 
nucleotides. The two polynucleotide chains run in opposite directions; 
that is, one strand runs in the 5' — * 3' direction, while the other strand 
runs in the 3' — > 5' direction. 

It was proposed in the now classic manuscript by Watson and Crick 8 in 
1953 that the two nucleotide chains are held together by hydrogen bonds 
that form between the nitrogenous bases. The polarity of the double 
helix requires specific hydrogen bonding between the bases so that they 
fit together. Guanine preferentially hydrogen-bonds with cytosine, and 
adenine can bond preferentially with thymine. That is, G pairs only 
with C, and A pairs only with T. These matching base pairs are referred 
to as complementary. For example, a short segment with ten nucleotides 
might be of the form 

5' - ATGCCCTGAC- 3' 

3'- TACGGGACTG - 5' 
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The specific base pairing of DNA is the mechanism by which encoded 
information can be transferred from generation to generation with very 
little alteration. When the DNA is copied by the processes of replication 
and transcription, the double helical structure of the DNA is opened 
up and a copy is made based on the specificity of the base pairing. 
Nucleotides are rejected if they do not have the correct base pairing 
(i.e., are unmatched). Even though these copying mechanisms proceed 
with high fidelity, mistakes are sometimes made through the insertion of 
a non-matching nucleotide, thus creating a mutation. Random mutation 
is one of the foundations for evolution, since it can introduce variations 
in the genetic code over time. 

The genome of an organism is made up of one or more long molecules 
of DNA that are organized into chromosomes. A chromosome consists 
of an uninterrupted length of double- stranded DNA that contains many 
genes. For most bacteria and fungi, the cells contain only one copy of 
the genetic material, and these organisms are called haploid. In higher 
organisms, two copies of each chromosome and its component genes are 
present; these organisms are called diploid. Human cells contain two sets 
of 23 chromosomes, for a total of 46. During reproduction 23 chromo- 
somes from the father and 23 from the mother are combined and mixed 
to make a new set of 46 chromosomes in the progeny. The chromosome 
pairs in the diploid cells may not be identical and may contain variants 
of the same gene. The variations have been created by random mutation 
over time. A mutation consists of a change in the sequence of base pairs 
(bp) in DNA. A mutation in a coding sequence may change the sequence 
of amino acids in the protein. Gene variants like this that occupy the 
same position or locus on a chromosome are called alleles. A gene may 
have multiple alleles. This mixing of gene variants explains how traits 
like eye color can be inherited from either the mother or the father. Two 
chromosomes with the same gene components are said to be homologous. 

The unit of replication is the chromosome. When a cell divides, all the 
chromosomes are replicated. When a chromosome is replicated, all its 
genes are replicated. In addition to containing information about a pro- 
tein component or function, genes also contain regulatory elements that 
determine when and where the gene in question needs to be transcribed 
and translated. 
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Figure 2.1. The Double-helix Structure of DNA. Source: The Cell - A Molecular 
Approach, by Cooper, G.M., Copyright 2000, Sinauer Associates Inc., Sunderland, 
Massachusetts. Reprinted with permission from Sinauer Associates Inc. 
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2.4. RNA 

As was introduced in section 2.1, the core biochemical flow of genetic 
information can be summarized as the process of RNA synthesis (tran- 
scription) and the process of protein synthesis (translation). The first 
step in making a protein is to copy, or transcribe, the information en- 
coded in the DNA of the genes into a single-stranded molecule called 
ribonucleic acid (RNA). Since this process is similar to the process of 
copying written words, the synthesis of RNA from DNA is called tran- 
scription. The DNA is said to be transcribed into RNA, and the RNA 
is called a transcript. The nucleotides of RNA contain the sugar ribose, 
while the nucleotides of DNA contain deoxyribose that has one more 
oxygen. Furthermore, instead of thymine, RNA contains uracil (U), a 
base that has hydrogen-bonding properties identical to those of thymine. 
Hence the RNA bases are G, C, A, and U. RNA is less stable than DNA. 
RNA synthesis requires the RNA polymerase enzyme complex that binds 
to a specific sequence at one end of a gene (the promoter ) and separates 
the two strands of DNA. It moves along the gene, maintaining the sep- 
arated strand “bubble”, and uses only one of the separated strands as 
a template, synthesizing an ever-growing tail of polymerized nucleotides 
that eventually becomes the full-length transcript. Hence, RNA is a 
single-stranded nucleotide chain, not a double helix. Since RNA is al- 
ways synthesized in the 5' — >• 3' direction, the addition of ribonucleotides 
by RNA polymerase is at the 3' end of the growing chain. 

There are two general classes of RNAs. Those that take part in the 
process of decoding genes into proteins are referred to as “ informational 
RNAs” called messenger RNA (mRNA). In the other class, the RNA 
itself is the final functional product. These RNAs are referred to as 
“ functional RNAs”. Functional RNAs are the transfer RNAs (tRNA) 
and the ribosomal RNA (rRNA), which are both part of the intricate 
protein synthesis machinery that translates the informational mRNA 
into protein. 

Figure 2.2 shows that the sequence of messenger RNA is complemen- 
tary to the sequence of the bottom strand of DNA and is identical to 
the top strand of DNA, except for the replacement of T with U. A mes- 
senger RNA includes a sequence of nucleotides that corresponds to the 
sequence of amino acids in the protein. This part of the nucleic acid is 
called the coding region. Because mRNA is an exact copy of the DNA 
coding regions, mRNA analysis can be used to identify polymorphisms 
in coding regions of DNA. A polymorphism is a DNA region for which 
nucleotide sequence variants exist in a population of organisms. Such 
variations can sometimes explain the occurrence of a disease or enzyme 
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deficiency within a population. Hence, a considerable effort has been 
put into trying to identify such variations. Microarray technology can be 
used both in the identification of polymorphisms and in the diagnosis of 
polymorphism-related disease. Organisms whose cells have a membrane- 
bound nucleus are called eukaryotes. For example, animals and plants 
are eukaryotes. In eukaryotic cells, the initial pre-mRNA transcription 
product can be many times longer than needed for translation into pro- 
tein. At the 5' end of a eukaryotic gene, there is a regulatory region to 
which various proteins bind, causing the gene to be transcribed at the 
right time and in the right amount. A region at the 3' end of the gene 
contains a sequence encoding the termination of transcription. In the 
genes of many eukaryotes, the protein-encoding sequence is interrupted 
by varying numbers of segments called introns. The coding sequence 
segments interrupted by the introns are called exons. Introns are re- 
moved in the splicing process to generate the final mature mRNA ready 
to be translated by the protein synthesis machinery. 



An exam Pi e of DIM A with two base-paired strands 
5’ ATGCGG ACCTG AC ATGCCGTT AG AG AC 3’ 
3’ T ACGCCTGG ACTGT ACGGC A ATCTCTG 5’ 

RNA is synthesized from one strand of DNA 
In the 5’ to 3’ direetion 

5’ A U( iCGG ACCUG AC AUGCCG UUAGAGAC 3 % 



Figure 2.2. RNA is synthesized by using the top strand of DNA in the 5' to 3' 
direction as a template for complementary base pairing 



2.5. The Genetic Code 

The sequence of nucleotides in DNA is important not because of its 
structure, but because it codes for the sequence of amino acids that 
dictate the structure of a protein with a defined function, be it struc- 
tural or catalytic. The relationship between a sequence of DNA and the 
sequence of the corresponding protein is called the genetic code. The 
genetic code is read in groups of three nucleotides, or codons, each of 
which represents one amino acid. Because each position in the three- 
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nucleotide codon could be one of the four bases A, C, G, and T, there 
are a total of 4 X 4 X 4 = 64 possible different codons, each representing 
an amino acid or a signal to terminate translation. As there are only 
20 common amino acids, several different codons can code for the same 
amino acid (the genetic code is said to be degenerate due to this many- 
to-one relationship). Since the genetic code is read in non-overlapping 
triplets, there are three possible ways of translating any nucleotide se- 
quence into a protein, depending on the starting point. These are called 
reading frames. A reading frame that starts with a special initiation 
codon (Af/G-methionine) and extends through a series of codons rep- 
resenting amino acids until it ends at one of three termination codons 
( UAA , UAG, UGA) can potentially be translated into a protein and is 
called an open reading frame (ORF). A long open reading frame is un- 
likely to exist by chance. The identification of a lengthy open reading 
frame is strong evidence that the sequence is translated into protein in 
that frame. An open reading frame for which no protein product has 
been identified is sometimes called an unidentified reading frame (URF). 



2.6. Proteins 

The primary structure of a protein is a linear chain of building blocks 
called amino acids. There are 20 amino acids that commonly occur in 
proteins. These amino acids are linked together by covalent bonds called 
peptide bonds. A peptide bond is formed through a condensation reaction 
during which one water molecule is removed. Because of the manner in 
which the peptide bond forms, a polypeptide chain always has an amino 
(NH‘ 2 ) end and a carboxyl (COOH) end. This primary chain is coiled 
and folded to form a functional protein. Proteins are the most important 
determinants of the properties of the cells and organisms. The biological 
role of most genes is to encode, or carry, information for the composition 
of proteins. This composition, together with the timing and amount of 
each protein produced, determines the structure and physiology of an 
organism, i.e., the phenotype. 

Because the process of reading the mRNA sequence and converting it 
into an amino acid sequence is like converting one language into another, 
the process of protein synthesis is called translation. The four-letter 
alphabet of the genes is translated into the 20-amino-acid alphabet of 
proteins in ribosomes. Ribosomes are big complexes of several proteins 
and ribosomal RNA (rRNA). The rRNA functions to guide mRNA into 
a correct starting position by binding to special sequences present in 
the beginning of all mRNAs. The translation of the genetic code into a 
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protein is achieved with the help of transfer RNAs (tRNAs). The tRNAs 
contain a trinucleotide sequence complementary to the codon called the 
anticodon. Each species of tRNA molecules is charged with a specific 
amino acid in an enzymatic reaction, hence coupling a certain amino 
acid to a certain anticodon nucleotide triplet on the tRNA molecule. 
In essence, the translation of the DNA code to protein amino acids is 
done in this enzymatic coupling step. The ribosome subsequently aligns 
the mRNA codon with the matching tRNA anticodon, and if the base 
pairing matches, the amino acid carried by the tRNA is attached to 
the growing chain of amino acids to form a polypeptide chain. Hence, 
the specific base pairing of the nucleotides once again ensures that the 
correct information is transferred. When the ribosome reaches a stop 
codon, it releases the polypeptide chain, which then folds into the defined 
three-dimensional structure of a protein. Proteins must often undergo 
post-translational modifications to become active. These modifications 
can, for instance, be cleavages of the polypeptide chain at predefined 
sites or binding of additional molecules like lipids, sugars, or co-factors 
that assist in catalysis of chemical reactions. 

2.7. Gene Expression and Microarrays 

Gene expression is the process by which mRNA, and eventually pro- 
tein, is synthesized from the DNA template of each gene. The first stage 
of this process is transcription, when an RNA copy of one strand of the 
DNA is produced. In eukaryotes it is followed by RNA splicing, during 
which the introns are cut out of the primary transcript and a mature 
mRNA is made. As part of the maturation process, a tail of adenine 
nucleotides is added to the 3' end of the mRNA. This poly A tail can 
vary greatly in length and is believed to stabilize the mRNA molecule. 
Transcription and splicing of RNA occur in the nucleus. The next stage 
of gene expression is the translation of the mRNA into protein. This 
occurs in the cytoplasm. In the process of gene expression. RNA pro- 
vides not only the essential substrate (mRNA) but also components of 
the protein synthesis apparatus (tRNA, rRNA). 

Some protein-encoding genes are transcribed more or less constantly; 
they are sometimes called housekeeping genes and are always needed for 
basic reactions. Other genes may be rendered unreadable or, to suit 
the functions of the organism, readable only at particular moments and 
under particular external conditions. The signal that masks or unmasks 
a gene may come from outside the cell; for example, from a nutrient or 
a hormone. Special regulatory sequences in the DNA dictate whether a 
gene will respond to the signals, and they in turn affect the transcription 
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of the protein-encoding gene. Understanding which genes are expressed 
under which condition gives invaluable information about the biological 
processes in the cell. The power of microarray technology lies in its 
ability to measure the expression of thousands of genes simultaneously. 

2.8. Complementary DNA (cDNA) 

Complementary DNA (cDNA) is used in recombinant DNA technol- 
ogy. cDNA is complementary to a given mRNA and is usually made by 
the enzyme reverse transcriptase, first discovered in retroviruses. Re- 
verse transcription allows a mature mRNA to be retrieved as cDNA 
without the interruption of non-coding introns. The coexistence of 
mRNA and cDNA establishes the general principle that information in 
the form of either type of nucleic acid sequence can be converted into 
the other type. In microarray technology the process of reverse tran- 
scription is frequently used to incorporate fluorescent dyes into cDNA 
complementary to the mRNA transcripts. 



RNA 



reverse transcription . 

— *■ cDNA 



2.9. Nucleic Acid Hybridization 

The specific base pairing of nucleic acids is the foundation of mi- 
croarray technology. The specific pairing of an artificial DNA sequence 
probe with its biological counterpart allows for exact identification of 
the sought-after unique sequence or gene. 

Because of the base-pairing arrangments, the two strands of DNA can 
separate and re-form very quickly under physiological conditions that 
disrupt the hydrogen bonds between the bases but are much too mild to 
pose any threat to the covalent bonds in the backbone of the DNA. The 
process of strand separation is called denaturation or melting. Because of 
the complementarity of the base pairs, the two separated complementary 
strands can be re-formed into a double helix (the two strands are then 
said to be annealed). This process is called renaturation. The technique 
of renaturation can be extended to allow any two complementary nucleic 
acid sequences to anneal with each other to form a duplex structure. 

Hybridization is the biochemical method on which DNA microarray 
technology is based. Nucleic acid sequences can be compared in terms 
of complementarity that is determined by the rules for base pairing. In 
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a perfect duplex of DNA, the strands are precisely complementary. It is 
possible to measure complementarity because the denaturation of DNA 
is reversible under appropriate conditions. Detecting and identifying 
nucleic acid (DNA, mRNA) with a labeled cDNA probe that is com- 
plementary to it is an application of nucleic acid hybridization. DNA 
microarrays utilize hybridization reactions between single-stranded fluo- 
rescent dye-labeled nucleic acids to be interrogated and single-stranded 
sequences immobilized on the chip surface. The next chapter will discuss 
the microarray technology in detail. 
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Chapter 3 



MICRO ARRAY TECHNOLOGY 



This chapter is contributed by Harry Bjorkbacka, Ph.D. 

Lipid Metabolism Unit, Massachusetts General Hospital, Boston 

Microarray technology allows measurement of the levels of thousands 
of different RNA molecules at a given point in the life of an organism, 
tissue, or cell. Comparisons of the levels of RNA molecules can be used 
to decipher the thousands of processes going on simultaneously in living 
organisms. Also, comparing healthy and diseased cells can yield vital in- 
formation on the causes of diseases. Microarrays have been successfully 
applied to several biological problems and, as arrays become more easily 
available to researchers, the popularity of these kinds of experiments will 
increase. The demand for good statistical analysis regimens and tools 
tailored for microarray data analysis will increase as the popularity of 
microarrays grows. The future will likely bring many new microarray 
applications, each with its own demands for specialized statistical analy- 
sis. 

In order to analyze any experimental data correctly, it is fundamen- 
tal to understand the experiments that generated the data. Microarray 
experiments contain many steps, each with its individual noise and vari- 
ation. The final result may be affected by any of the steps in the process. 
Good experimental design and careful statistical analysis are required for 
successful interpretation of microarray data. This chapter will review 
the most commonly used microarray technology platforms, pointing out 
their strengths and weaknesses. More in-depth descriptions of some spe- 
cific microarray protocols can be found in pioneering research articles by 
Shalon et al. 1 , Lockhart et al. 2 . Lander 3 , Lipschutz et al. 4 , Brown and 
Botsteim, Eisen and Brown 6 , Southern et al. 1 , Bowtell 8 , Cheung et al. 9 . 
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and books by Schena 10 , Baldi and Hatfield 11 , Bowtell and Sambrook 12 , 
and Grigorenko 13 , among others. 

3.1. Transcriptional Profiling 

With today’s advances in microarray technology, large sets of gene 
expression data can be created. Such catalogues are called gene expres- 
sion or transcriptional profiles, and the process of gathering the data is 
called profiling. Transcriptional profiling can be either sequencing- or 
hybridization-based. 

3.1.1. Sequencing-based Transcriptional Profiling 

Sequencing-based approaches include sequencing of complementary 
DNA (cDNA) libraries and serial analysis of gene expression (SAGE). 
Libraries of cDNA are created by reverse transcription of mRNAs ex- 
pressed in a tissue or a cell type under some treatment or condition. 
Individual cDNA clones are created by recombinant DNA technology. 
Sequencing the cDNA reveals the identity of the clone either as a known 
sequence in a public database or as a novel unknown sequence. The num- 
ber of clones with the same unique sequence is related to the expression 
levels in the mRNA pool used to create the library. 

The basic idea of SAGE is to generate short cDNA “sequence tags” 
from a pool of mRNA, combining them and sequencing several “tags” 
at a time. A sequence tag is a stretch of 10-14 base pairs (bp) that 
contains sufficient information to uniquely identify a transcript, provided 
that the tag is obtained from a unique position within each transcript. 
The tags are created by special restriction enzymes that cut DNA at 
specific sequences. First, the cDNAs are cut with an enzyme at a short 
recognition sequence, and then with another enzyme that cuts about 20 
base pairs away from the same recognition sequence. The tags are also 
created such that they can be cloned and sequenced easily. The number 
of times a particular tag is observed in the sequencing data indicates the 
expression level of the corresponding transcript. An overview of SAGE 
is given in Figure 3.1. 

Biotin-labeled double-stranded cDNA is cleaved with a restriction en- 
donuclease (anchoring enzyme). The 3' ends of the resulting cDNA frag- 
ments are then purified using streptavidin-coated magnetic beads, and 
the resulting cDNA fragments are divided into two populations, each of 
which is ligated to a different linker (1 or 2 in the figure) containing a 
type-IIS restriction endonuclease (tagging enzyme) recognition sequence. 
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Such enzymes cleave DNA up to 20 bp away from their recognition site. 
Digestion of the two cDNA populations thus results in the generation of 
a short sequence consisting of the linker and a short portion of its adja- 
cent cDNA. Following the creation of blunt ends, the two populations are 
ligated to each other and total cDNA is amplified by polymerase chain 
reaction (PCR), resulting in the generation of products with two tags (a 
ditag) orientated tail-to-tail, with an anchoring enzyme recognition site 
at either end. Following cleavage at each anchoring enzyme recognition 
sequence and concatenization of ditags via this site, products are cloned 
and individual clones consisting of at least 25 tags (25-75) are selected 
for sequencing. 



cDNA primed with biotin-oiigo(dT) 
AAAAA ^ 

AAAAA 



Restriction digest and bind to streptavidin beads 



GTAC - 



-AAAAA 
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Figure 3.1 . Sequencing-based transcriptional profiling and serial analysis of gene 
expression (SAGE) 
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3.1.2. Hybridization-based Transcriptional Profiling 



Microarray technology has evolved from Ed Southern’s insight that 
labeled nucleic acid molecules could be used to identify nucleic acid 
molecules attached to a solid support. Hybridization methods, such 
as Southern and Northern blots, colony hybridizations, and dot blots, 
have long been used to identify and quantify nucleic acids in biological 
samples. These methods traditionally attempt to identify and measure 
only one gene or transcript at a time. 



Hybridization methods have evolved from these early membrane-based, 
radioactive detection embodiments to highly parallel quantitative meth- 
ods using fluorescence detection. Some key innovations have made it 
possible to develop techniques that analyze hundreds or thousands of 
hybridizations in parallel. The first was the use of non-porous solid 
supports, such as nylon filters or glass slides, which facilitate miniatur- 
ization. The second was the development of methods for spatial synthe- 
sis and robotic spotting of oligonucleotides and cDNAs on a very small 
scale. These methods have made it possible to generate arrays with very 
high densities of DNA, allowing tens of thousands of genes to be repre- 
sented in areas smaller than standard glass microscope slides. In fact, 
today it is technically possible to generate arrays of probes represent- 
ing all the genes of a genome on a single slide. Finally, improvements 
in fluorescent labeling of nucleic acids, fluorescent-based detection, and 
image processing have improved the accuracy of microarrays. 



Before describing the process of generating and using microarrays in 
more detail, a clarification of the nomenclature is needed. At least two 
nomenclature systems currently exist in the literature for referring to 
DNA hybridization partners. There is no general consensus on the usage 
of the terms probe and target, and researchers have used these two terms 
interchangeably in a number of publications. With respect to the nucleic 
acids whose entwining represents the hybridization reaction, the identity 
of one is defined as it is tethered to the solid phase, making up the 
microarray itself. The identity of the other is revealed by hybridization. 
Nature Genetics 14 and Duggan et alP adopted the nomenclature that 
the tethered nucleic acids spotted on the array are the probes, and the 
fluor-tagged cDNAs from a complex mRNA mixture extracted from cells 
are the targets. This book will follow this nomenclature. 
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3.2. Microarray Technological Platforms 

Microarrays allow large numbers of DNA clones with known sequences 
to be immobilized as an array of detection units (probes), while the pool 
of RNAs to be examined ( targets ) is fluorescently labeled and then hy- 
bridized to the detectors. There are three main microarray technological 
platforms, namely spotted cDNA arrays, spotted oligonucleotide arrays, 
and in-situ oligonucleotide arrays (e.g., Affymetrix GeneChip Arrays ). 
The differences between these three platforms lie in the way the arrays 
are produced and the types of probes used (see Figure 3.2). 

PROBE: ARRAYING TECHNIQUE: MICROARRAY PLATFORM: 

cDNA * Robotic spotting — - Spotted cDNA microarrays 

Robotic spotting 

In situ synthesis— /n s/tu oligonucleotide 

microarrays 





Figure 3.2. The three common microarray platforms are distinguished by the probe 
type and arraying technique used in manufacturing. 

In spotted cDNA arrays, Figure 3.3, full-length cDNA clones or ex- 
pressed sequence tag (EST) libraries are robotically spotted and immo- 
bilized on the support. Many laboratories already have cDNA libraries, 
so generation of these arrays requires only investment in the robotic 
equipment to spot, or array, the cDNA. Spotted cDNA arrays have an 
advantage over other types of arrays in that unknown sequences can be 
spotted. Thus, for organisms for which no or only limited genome se- 
quence information is available, spotted cDNA microarrays are the only 
choice for genome- wide transcriptional profiling. 

Spotted oligonucleotide arrays, Figure 3.4, are very similar to spotted 
cDNA arrays, except that synthetic oligonucleotides (abbreviated oligos) 
instead of cDNA are used as probes. In fact, the same robotics can be 
used to manufacture both types of arrays. When sequence information 
is available, oligonucleotide probes of 20-70 nucleotides can be designed 
and synthesized. Using designed oligonucleotide probes gives better con- 
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trol over what part of the gene will be utilized for hybridization. The 
oligonucleotides can, for instance, be designed to optimally differentiate 
between highly similar transcripts that might cross-hybridize on a cDNA 
array. 

In-situ oligonucleotide arrays, Figure 3.5, were developed by Fodor 
et al. 16 and Affymetrix, Inc. In-situ oligonucleotide arrays use a combi- 
nation of photolithography and solid-phase oligonucleotide chemistry to 
synthesize short oligonucleotide probes (25-mer oligos) directly on the 
solid support surface. The number of oligonucleotides (50,000 probes 
per 1.28 square centimeters) on a chip manufactured by this method 
vastly exceeds what can be achieved by spotting solution robotically. 
Affymetrix Inc. has chosen to utilize this advantage to construct an 
array with several oligonucleotide probes and cross-hybridization con- 
trols for each target gene. However, the researcher has little, if any, 
control over what probes are used on pre- manufactured arrays like the 
Affymetrix GeneChip arrays. On the other hand, comparison of results 
between different laboratories is facilitated by the use of products from 
a common manufacturer. 

For in-situ oligonucleotide arrays, the test and reference samples (or 
the treatment and control samples) are hybridized separately on dif- 
ferent chips. In contrast, for either spotted cDNA arrays or spotted 
oligonucleotide arrays, a test and a reference sample labeled with two 
different fluorescent dyes are commonly simultaneously hybridized on 
the same arrays. This difference affects how microarray data generated 
with single-color or two-color arrays are analyzed (see section 3.8). 



3.3. Probe Selection and Synthesis 

For large-scale gene expression studies, the first step is to select and 
prepare the specific hybridization detectors (probes). 

cDNA microarray technology provides great flexibility in the choice 
and production of ordered arrays, since the probes can be created from 
cDNA libraries. A cDNA library is a set of plasmid vectors with inserted 
mRNA segments turned into cDNA by reverse transcription, usually 
harbored in bacterial clones. Plasmids are independently replicating 
small extrachromosomal DNA molecules. The probe to be used on a 
cDNA array can be amplified from such a cDNA library by PCR (Fig- 
ure 3.3). PCR allows amplification of a specific segment of template 
DNA between two sequence-specific hybridized oligonucleotide primers. 
Many laboratories have already created cDNA libraries for a wide vari- 
ety of organisms, cells, and tissues. For organisms whose genomes have 
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been fully sequenced, one can amplify every known and predicted open 
reading frame (ORF) in the genome using reverse transcription PCR 
(RT-PCR) and sequence-specific primers. In organisms with smaller 
genomes and infrequent introns, such as yeast and prokaryotic microbes, 
purified total genomic DNA serves as a template, and sequence-specific 
oligonucleotides are used as primers. 

When dealing with large genomes and genes with frequent introns, 
such as those of the human and mouse, cloned expressed sequence tags 
(EST), individual full-length cDNA clones, or collections of partially se- 
quenced cDNAs corresponding to each of these transcripts can be used 
as the source of gene-specific detector probes in an array. Many methods 
are available for recovering purified cDNA from the PCR amplification 
reaction. A simple method is to prepare purified template cDNAs from 
the bacterial colonies that harbor them and follow-up with ethanol pre- 
cipitation, gel filtration, or both, to prepare relatively pure cDNA for 
printing. The choice of template source and PCR strategy vary with the 
organism being studied. 

Synthesized oligonucleotides can also be used as probes in spotted 
microarrays (Figure 3.4). Genes of interest are chosen from public se- 
quence databases including GeneBank, dbEST, and UniGene. Many 
variables have to be considered in selecting the sequence of the oligonu- 
cleotide to be made. First, the length of the oligonucleotide has to be 
chosen. The longer the oligonucleotide, the more specific it will be. 
However, longer oligonucleotides are more costly and more difficult to 
make. Today several commercial oligonucleotide sets are available for 
mouse, human, and other organisms, varying in probe length between 
30 and 70 nucleotides. Second, the probes must be selected so that they 
are specific for their target genes. If similarities exist between probes 
on the same microarray, they can cross-hybridize to more than one gene 
target, making the results hard to analyze. Third, all oligonucleotides 
must have similar hybridization properties. Usually all the probes are 
designed so that their melting temperature is within 1-2 degrees Cel- 
sius and that they have a similar content of G and C nucleotide base 
pairs. Several probe selection algorithms have been developed, but so 
far no consensus exists on the most effective design principles. For in- 
stance, there are several algorithms just for calculating oligonucleotide 
melting temperatures. Other considerations that may go into designing 
the probes are self-hybridization properties (palindromic sequences) and 
synthesis efficiency of certain sequences. The location of the probe along 
the message may also be important. 
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The probe can be selected to hybridize to the 3 ' -untranslated region 
(UTR), 3' or 5' coding sequences. Another matter to consider is the 
complementarity of the probes. Before designing oligonucleotide probes 
for spotted arrays, a labeling strategy must be chosen to assure that 
probes and targets will hybridize. The mRNA is the sense strand of 
the message, and when converting the mRNA to single-stranded labeled 
cDNA, the cDNA will be antisense and then only hybridize specifically 
to sense strand oligonucleotide probes (see section 3.5 on Target Label- 
ing). The probes in spotted cDNA arrays are double-stranded and can 
hybridize either to a labeled anti-sense or to a sense strand target. 

The probes for in-situ oligonucleotide arrays are synthesized directly 
on the surface of the support. The probes are 20 nucleotides long and are 
organized as perfect-match versus mismatch pairs, with the mismatch 
probe acting as a control for hybridization specificity. Perfect-match 
(PM) probes are designed to be complementary to the target sequence. 
Mismatch (MM) probes are designed to be complementary to the target 
sequence except for one base mismatch at the central position that has 
equal distance from either ends of the probe. Mismatch probes serve as 
controls for cross-hybridization. A probe cell is a single square feature 
on an array containing a PM or MM probe. The size of the feature can 
vary depending on the array type. Each probe cell contains millions of 
probe molecules. A probe pair consist of two probe cells: a PM and its 
corresponding MM. On the array, a probe pair is arranged with the PM 
cell directly adjacent to the MM cell. When the probes are designed, 
several probe pairs are selected to represent each transcript. A probe set 
designed to detect one transcript usually consists of 16-20 probe pairs. 



DNA chips may contain 100,000 different oligos in a 4-cm 2 area, and 
each probe cell has approximately 10 7 oligo molecules. By designing 
oligos that span an entire exon using a register of one nucleotide change 
between adjacent probe cells and a window of 25 nucleotides at a time, it 
is possible to utilize DNA chip technology for DNA resequencing. In this 
resequencing strategy, the midpoint nucleotide (number 13 in a 25-mer) 
is synthesized as a G, C, A, or T. Using PCR products and hybridiza- 
tion conditions that discriminate between perfect and single base-pair 
mismatch duplexes, it is possible to read the sequence across the target 
DNA based on the most intense signal in each set of four oligo probes. 
Affymetrix is also producing in-situ oligonucleotide arrays that contain 
probes corresponding to every known single nucleotide polymorphism 
(SNP) in the human genome 17 . A SNP is a single base-pair site within 
the genome at which more than one of the four possible base pairs is 
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commonly found in natural populations. SNPs can be inherited markers 
of a disease. 

3.4. Array Manufacturing 

The in-situ oligonucleotide arrays developed by Affymetrix, Inc. are 
made using photolithography, in which light is passed through holes in a 
mask (Figures 3.5). The light deprotects chemically reactive groups in 
locations on the support specified by the holes in the mask. A nucleotide 
can subsequently be coupled to the activated group. By varying the 
location of the holes in the mask and the nucleotides coupled in each 
step, oligonucleotide probes can be synthesized in situ at a very high 
density. 

In spotted cDNA arrays and spotted oligonucleotide arrays, the probe 
is deposited as a solution on the surface of the support and then attached 
(Figures 3.3 and 3.4). Several different choices of support materials 
are available, ranging from plastic polymers to glass. The surface of 
the popular glass microscope slides are usually coated with chemicals 
to reduce the background fluorescence and nonspecific binding of the 
labeled target. This can be done by the manufacturer or in-house before 
arraying the probes. Surface coatings can be relatively simple poly-lysine 
coatings or more complex 3-D molecular matrix layers. 

There are two basic methods for spotting or arraying cDNA and 
oligonucleotides onto the support. Contact spotting relies on pins to 
pick up solution by capillary action and deposit drops upon contact. The 
first cDNA microarrays were created in this fashion in Patrick Brown’s 
laboratory at Stanford University. Non-contact spotting is done by vari- 
ations of the Inkjet technology, in which small drops of solution can be 
sprayed with high precision onto a surface. Inkjet printers have been 
used to spot both cDNAs and oligos (Packard Piezoelectric dispensing 
system). They also have been used to spot free nucleotides to synthe- 
size oligonucleotides in situ on the solid support (Agilent Technologies 
SurePrint Inkjet technology). After spotting, the probes have to be at- 
tached covalently to the surface of the support. The method of choice 
for cDNAs is ultra-violet (UV) light cross-linking, which forms attach- 
ments at random sites along the probe molecule. Oligonucleotides may 
also be UV cross-linked, but also more specifically attached by a linker 
molecule. Adding a linker molecule to the end of the oligonucleotide in 
the synthesis allows the oligonucleotides to be specifically attached at 
one end to the surface of the support material. The support material 
surface must have the appropriate reactive groups to allow attachment 
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of oligonucleotides via a linker molecule. Since not all surface coatings 
have reactive groups that can be used to couple oligonucleotides cova- 
lently, the choice of coating depends on what attachment method will be 
used. If linkers are used, the oligonucleotides must be allowed to react to 
form the covalent bonds, and the remaining unreacted reactive groups 
on the surface must be blocked before the slide is ready for use. 

3.5. Target Labeling 

A wide variety of target labeling methods are available today. Fig- 
ure 3.6 describes three commonly used labeling regimens. RNA can be 
labeled through reverse transcription (RT) incorporating modified nu- 
cleotides (nt), either directly tagged with a fluorescent marker or later 
chemically attached to the modification. A very different strategy in- 
volves amplification of the RNA message by running in vitro transcrip- 
tion (IVT) from a promoter sequence incorporated into the cDNA in 
the RT reaction. Several cRNA transcripts can be created from each 
cDNA in the IVT reaction amplifying the original message. Biotin- 
modified nucleotides incorporated in the IVT step later serve as handles 
for fluorescent dye tagging. 

Most methods involve reverse transcription of the mRNA in some 
fashion. The choice of target labeling strategy for both spotted ar- 
rays and in-situ oligonucleotide arrays must ensure that the probes and 
the labeled targets are complementary. In the simplest labeling strat- 
egy, a fluorescent dye-modified nucleotide is incorporated in the reverse 
transcription of mRNA to cDNA. The reverse transcription needs to be 
primed by a hybridized short oligonucleotide primer. The primer can 
be either an oligo-dT (usually 12-18 nucleotides long) that anneals to 
the poly-A tail of eukaryotic mRNAs or a random primer (typically 6 
nucleotides long) that will initiate reverse transcription at random sites 
along the mRNA. The number of nucleotides modified by fluorescent 
groups must be balanced to avoid incorporating dye molecules too fre- 
quently (which will quench the fluorescence signal) yet frequently enough 
to provide a good signal. 
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For both spotted cDNA arrays and spotted oligonucleotide arrays, 
usually two samples (a test and a reference) are hybridized to each slide 
to compensate for slide differences in the spotting process (Figures 3.3 
and 3.4). The test and reference mRNA samples (for instance, a dis- 
eased and a healthy tissue sample) are reverse-transcribed and labeled 
with two different fluorescent dyes in separate test tubes and later pooled 
for simultaneous hybridization to the microarray slide probes. The most 
popular choices for fluorescent dyes are the Cy3 (green) and Cy5 (red) 
dyes. Several options exist, however, and systems have been developed 
to allow for detection of multiple dye-labeled hybridized targets 18 . The 
emission spectra of all the dyes used have to be sufficiently separated 
in wavelength to allow for individual detection. A drawback of using 
dye-modified nucleotides in the reverse transcription reaction is that dif- 
ferent dye molecules can be incorporated with different efficiency into the 
cDNA. To overcome this obstacle, labeling methods have been developed 
in which nucleotides modified with an aminoallyl group are incorporated 
into both the test and the reference sample in different test tubes. The 
dye can later be attached to the test and reference cDNA with aminoal- 
lyl groups in a chemical reaction that presumably is less sensitive to the 
structural bias of the dye molecule. 

Traditionally, 50-100 /j.g of total RNA (mRNA, rRNA, and tRNA) 
has been needed to obtain good signals. With the development of better 
reverse transcription enzymes and more efficient labeling procedures, the 
need for total RNA is down to about 10-20 fig. Some biological samples 
are so small that such RNA quantities cannot be extracted. Hence, 
PCR or in vitro transcription-based methods for amplifying the mRNA 
have been developed. With PCR, any nucleotide sequence can be hugely 
amplified, but a frequent concern with such methods is that the relative 
abundance of different mRNA species can be skewed in the amplification 
process and might not accurately reflect the levels in the original sample. 
In general, the in vitro transcription amplification methods are thought 
to amplify mRNAs with less bias. In in vitro transcription, the reverse 
transcription reaction is primed by an oligo-dT primer with a promoter 
sequence extension. After the reverse transcription, the double-stranded 
cDNA is generated and complementary RNA ( cRNA ) can be synthesized 
by transcription from the promoter sequencer a test tube. Several cRNA 
molecules can be transcribed from each double-stranded cDNA template. 

In-situ oligonucleotide arrays are manufactured with good precision, 
allowing for comparison between arrays that have been hybridized with 
a single labeled target sample (Figure 3.5). In vitro transcription la- 
beling is the method of choice for labeling targets for in-situ oligonu- 




34 



ANALYSIS OF MICROARRAY GENE EXPRESSION DATA 



cleotide arrays (Figure 3.5). For GeneChip arrays from Affymetrix, 
biotin-modified nucleotides are incorporated into the cRNA, which is 
then fragmented to approximately equal lengths to give more uniform 
hybridization properties. The biotin modifications are used as a handle 
for dye attachment. Methods have also been developed in which mRNA 
can be chemically labeled directly without the need of conversion to 
cDNA (PerkinElmer Life Sciences). Methods that attach more dye per 
transcript have also been developed, including systems in which the dye 
is enzymatically attached to pre-hybridized reverse-transcribed targets 
(PerkinElmer Life Sciences) and systems in which dye-oligonucleotide 
multimers (dendrimers) are hybridized to reverse-transcribed targets 
modified with a universal oligonucleotide extension (Genisphere). 

3.6. Hybridization 

Hybridization of the labeled target to the probes on a microarray is 
performed by adding the targets dissolved in hybridization buffer to the 
slide within a confined space, followed by incubation for a given amount 
of time at a certain temperature. The hybridization can, for instance, 
be performed under a microscope slide cover slip or within a chamber 
that limits the volume. Volumes are kept small to reduce the time of 
hybridization. Automated hybridization stations have been developed 
that agitate the hybridization solution over the slide and allow for better 
control of hybridization conditions, which gives lower backgrounds and 
better reproducibility. The hybridization conditions need to be set so as 
to promote the specific hybridizations between the target and individual 
probes and limit nonspecific hybridizations to the support itself or other 
probes. This is achieved mainly by varying the temperature and the 
ionic strength of the hybridization buffer. The temperature needs to be 
lower than the melting temperature of the probes but sufficiently high 
to reduce nonspecific hybridizations. The salt concentration, pH, and 
other characteristics of the buffer may also promote specific hybridiza- 
tions. It may be advantageous to add competing DNA like salmon sperm 
DNA, Cot-1 DNA (enriched with mammalian repetitive sequences), and 
poly -A DNA (to block nonspecific hybridization to poly A regions). Af- 
ter hybridization for anywhere from several hours to overnight, the hy- 
bridization solution is discarded and the slides are subjected to washes 
of varying ionic strength to remove nonspecifically bound targets with 
increasing stringency. After the wash regimen, the slide is dried and is 
ready to be scanned. The dyes used are degraded over time by expo- 
sure to light, so hybridized slides and labeled target solutions need to be 
stored in the dark. 
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3.7. Scanning and Image Analysis 

After hybridization, an image of the array with hybridized fluorescent 
dyes must be acquired. Microarray scanners have confocal lasers or other 
light sources to produce light at the wavelength that excites the fluores- 
cent dyes. The fluorescence emission intensity of the dyes is captured in 
high-resolution monochrome images acquired for each fluorescent dye. 
The scanner software then displays a composite colored image for multi- 
dye hybridizations. The goal is to measure, for each spot on the array, 
the relative amount of fluorescence from each dye hybridized with its 
target. Next, the probe spots have to be identified and the fluorescence 
intensities quantified from the high-resolution image. 

The process of scanning spotted cDNA microarray images can be sep- 
arated into three tasks, including gridding, segmentation, and intensity 
extraction. Gridding or addressing is the process of assigning coordinates 
to each of the spots. A grid is placed at the approximate location of the 
spots, and an algorithm is used to fine-tune the location of the spots 
and classify pixels as part of a spot or the background. Also, the grid 
file typically contains information about the identity of the individual 
spots. 

The segmentation procedure allows the classification of pixels either 
as foreground (i.e., as corresponding to a spot of interest) or as back- 
ground. Segmentation algorithms designed to detect spot boundaries 
include edge detection algorithms, histogram methods, fixed-circle, and 
adaptive-circle segmentation. The location of the spots can be man- 
ually corrected to account for effects such as dust particles missed by 
the algorithm. Once the locations of the spots have been decided, the 
fluorescence intensities of all the pixels within each spot and outside it 
(i.e., the local background) are calculated and a result file is reported, 
in a step called intensity extraction. This includes calculating, for each 
spot on the array, foreground fluorescence intensities, background fluo- 
rescence intensities, locations, averages, medians, standard deviations, 
and possibly quality measures. These result files contain all the raw 
fluorescence data available for analysis of an microarray experiment. 

Few microarray images are perfect. Image analyis is confounded by 
many factors such as the compensation for the non-zero fluorescent back- 
ground observed on most arrays and blemishes on the slides. Common 
problems in spot shape and appearance such as comet tails or donut 
holes are often observed. Locally high background and weak signals are 
also problematic phenomena in array images. 
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3.8. Microarray Data 

Microarray data have two basic qualities, biological significance and 
statistical significance. The biological significance tells how much the 
expression of a gene is influenced by the condition under study, i.e., 
the expression ratio or the fold change. The biological significance is 
what researchers are interested in. The statistical significance tells how 
trustworthy the biological significance is. Because of the many sources of 
variability in microarray experiments, the statistical analysis is crucial 
for successful interpretation of the biological phenomena under study. 
To perform meaningful statistical analysis of microarray data, one must 
understand the format of the microarray data and how to interpret the 
data. The microarray data format for in-situ oligonucleotide arrays is 
very different compared to that of spotted microarrays, and each data 
format has its own analysis and interpretation requirements. 

3.8.1. Spotted Array Data 

Gene expression data obtained from either spotted cDNA arrays or 
from spotted oligonucleotide arrays are similar in format. There are 
several software packages, both commercial and freeware, available for 
image analysis and quantification of spotted microarrays. The formats 
of the result files vary greatly. Typically, spreadsheets are used to report 
raw array data, including the location of the spots, gene identity, mean 
and median of the pixel intensities within a spot, and local background 
intensities. 

The simultaneous hybridization of two specimen samples labeled with 
Cy3 (green) and Cy5 (red) dyes has special analysis requirements. The 
two dyes have different properties and light sensitivities. The Cy5 dye 
is much more readily broken down, with the result that even if equal 
amounts of target are hybridized, the fluorescence emission from the Cy5 
dye is lower. Hence, the fluorescence signals from the two dye channels 
have to be normalized in order to calculate correct expression ratios. 
Normalization not only corrects for different dye properties but also for 
concentration differences between the co-hybridized test and reference 
samples. Normalization methods or algorithms must be selected with 
reference to the kind and degree of systematic bias that is present. 

In practice, expression data obtained from either spotted cDNA ar- 
rays or spotted oligonucleotide arrays are often reported as an expression 
ratio. The expression ratio is simply the normalized ratio of the fluores- 
cence intensity of the test sample and the reference sample for a certain 
gene. Often intensities of both the test and reference samples are back- 
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ground corrected (subtracted) before calculating the normalized ratio. 
Genes that are upregulated two-fold have an expression ratio of two, and 
genes that are downregulated two-fold have an expression ratio of 0.5. 
The expression ratios are often expressed as log-2 ratios. A gene that is 
upregulated two-fold will have a log-2 expression ratio of 1 and a gene 
downregulated two-fold will have a log-2 ratio of -1. 

It may not be possible to calculate an expression ratio for all genes on 
an array. A gene may be unexpressed in the reference sample but highly 
induced in the test sample. Calculation of an expression ratio in such a 
case would need division by zero. These genes are usually of high inter- 
est, for instance in diseased versus normal gene expression. To be able 
to get at least an approximate expression ratio, even though it will be 
an underestimate of the true ratio, one often has to set a lowest allowed 
level of signal. This lowest allowed signal can be set to an arbitrary value 
or to the level of noise, for instance, some specified number of standard 
deviations of the background signal. It may be advantageous to discard 
or attribute low-intensity signals with confidence scores to filter out un- 
reliable intensity readings. Since the image processing has to be highly 
automated to increase the throughput of analysis, methods like outlier 
detection, which improve the data quality after quantification, may have 
considerable impact on the reliability of microarray data. 



3.8.2. In-situ Oligonucleotide Array Data 

The GeneChip Absolute Analysis 19 calculates a variety of metrics us- 
ing hybridization intensities measured by the scanner. GeneChip probe 
arrays are scanned at high pixel resolution. In the case of a higher density 
probe array, it creates 8 pixels x 8 pixels (on average) for every probe cell, 
or a total of 64 pixels per probe cell (viewable in the .DAT file). A single 
intensity value for every probe cell, representative of the hybridization 
level of its target, is derived. The bordering pixels are excluded. The re- 
maining pixel intensity distribution is calculated, and the intensity value 
associated with 75% of the distribution is used as the Average Intensity 
of the probe cell. The Average Intensities for all probe cells are saved in 
the .CEL file. Some metrics utilize intensity data from the entire probe 
array and are used for Background and Noise calculations. Other met- 
rics compare the intensities of the sequence-specific PM probe cells with 
their control MM probe cells for each probe set 20 . The use of the PM mi- 
nus MM differences averaged across a set of probes is aimed at reducing 
the contribution of background and cross-hybridization while increasing 
the quantitative accuracy and reproducibility of the measurements. 
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The GeneChip software provides both absolute analysis and compar- 
ison analysis algorithms. The software calculates an average intensity 
value for every probe cell. Then, the background is calculated and sub- 
stracted from the intensities of all probe cells. The noise is calculated 
by determining the degree of pixel-to-pixel variation within the same 
probe cells used to calculate the background. The numbers of positive 
and negative probe pairs are determined for every probe set. A Posi- 
tive probe pair is one in which the intensity of the sequence- specific PM 
probe cell is significantly higher than the intensity of the control MM 
probe cell. A Negative probe pair is one in which the intensity of the MM 
probe cell is significantly higher than the intensity of the PM probe cell. 
Log average ratios and average differences are calculated directly using 
probe cell intensities. The log average ratio is derived from the ratio of 
PM probe cell intensity to that of the control MM. The average differ- 
ence for each probe set (the average of the differences between every PM 
probe cell and its control MM probe cell) is directly related to the level 
of expression of the transcript. By examining the positive fraction, the 
positive/negative ratio, and the log average ratio, a decision matrix may 
be employed to determine whether a transcript is Present (P), Marginal 
(M), or Absent (A; undetected). The Absolute call is displayed in the 
.CHP file in the GeneChip software. 

GeneChip Comparison Analysis performs additional calculations on 
data from two separate probe array experiments in order to compare 
gene expression levels between two samples. The analysis employs nor- 
malization by scaling techniques to minimize differences in overall signal 
intensities between the two arrays, allowing for more reliable detection of 
biologically relevant changes in the samples. The Comparison Analysis 
begins with the user designating an Absolute Analysis of one probe ar- 
ray experiment as the source of Baseline data and a second probe array 
experiment as the source of Experimental data to be compared to the 
Baseline. The results are used in a decision matrix to derive a Difference 
Call. Since the average difference of a transcript is directly related to 
its expression level, an estimate of the Fold Change of the transcript 
between the baseline and the experimental samples is also calculated. 

Although Affymetrix has developed prediction rules to guide the selec- 
tion of probe sequences with high specificity and sensitivity 21 , inevitably 
there remain some probes that hybridize to one or more nontarget genes. 
In the standard analysis 22 , the mean and standard deviation (SD) of the 
PM-MM differences of a probe set in one array are computed after ex- 
cluding the maximum and the minimum. If a difference is more than 3 
SD from the mean, a probe pair is marked as an outlier in this array 
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and is discarded when calculating average differences of both the base- 
line and the experimental arrays. One drawback to this approach is that 
a probe with a large response might well be the most informative but 
may be consistently discarded. Furthermore, if one wants to compare 
many arrays at the same time, this method tends to exclude too many 
probes. Li and Wong 23 (2001) show that even after making use of the 
control information provided by the MM intensity, the information on 
expression provided by the different probes for the same gene are still 
highly variable. They note that the between-array variation in PM-MM 
differences can be substantial and that the variation due to probe effects 
is larger than the variation due to arrays. In addition, human inspection 
and manual masking of image artifacts is currently time-consuming. Li 
and Wong (2001) use a multiplicative model to detect and handle cross- 
hybridizing probes, image contamination, and outliers from other causes. 



3.9. So I Have My Microarray Data - What’s 
Next? 

The remainder of this book will focus on statistical analysis of microar- 
ray data. Microarray technology has been used mostly as an exploratory 
or survey methodology, where the most interesting findings are verified 
by other more precise methods. The goal of the statistical analysis then 
becomes to limit the number of false positives and to validate clusters or 
groups of genes that were found based on their expression pattern. Mi- 
croarray experiments may never produce precise quantitative results for 
all measured transcripts. Instead, the power of microarray technology 
lies in its potential to survey the entire transcriptome in one experiment. 
With sound statistical analysis, patterns emerging from such a holistic 
view may give insights that traditional experiments never can. 



3.9.1. Confirming Microarray Results 

Microarrays are often used to find candidate genes for further studies. 
Because of the cost of microarrays, many investigators cannot afford 
thorough experimental designs with the numbers of replicates needed 
for a good statistical analysis. It then becomes imperative to verify the 
microarray result before spending a lot of time and money on follow-up 
experiments. Even with sound statistical analysis it may prove advanta- 
geous to first verify the most interesting genes coming from the analysis 
with an independent method. Two methods of choice are Northern blot 
analysis and quantitative real-time PCR. 
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3.9.2. Northern Blot Analysis 

Specific RNA sequences can be detected by blotting and hybridization 
analysis using techniques very similar to those originally developed by 
Ed Southern for DNA. Different RNA molecules can be size-separated 
or fractionated by gel electrophoresis. The gel electrophoresis is run un- 
der denaturing conditions to keep the single-stranded RNA molecules 
denatured, thus allowing good separation. The RNA to be separated 
is loaded into a gel-like solid support consisting of an agarose polymer. 
Upon application of a current, the RNA molecules will migrate in the gel 
due to their phosphate backbone negative charge. A long molecule will 
migrate slowly through the agarose gel, while a short molecule will mi- 
grate quickly. After allowing the RNA molecules to separate for a while, 
the electrophoresis is stopped and the fractionated RNA is transferred 
from the agarose gel to a membrane support (Northern blotting) by 
capillary transfer. After transfer the RNA is immobilized on the mem- 
brane. The membrane can now be probed with radioactively labeled 
single-stranded cDNA probes complementary to the gene of interest. 
The amount of radioactivity hybridized to a test and reference sample 
separated on the same gel is proportional to the relative expression of 
the gene of interest in the test and reference samples. For more infor- 
mation on the Northern blot technology, see Ausubel et at ? 4 (1993) and 
Sambrook et al . 25 (1989). 

Chu et al . 26 (1998) compared Northern blot assay of gene expression 
and corresponding microarray data of sporulation in budding yeast. As 
shown in Figure 3.7, the authors found that the pattern of expression 
as assayed by Northern analysis was very similar to that determined by 
microarray analysis. 

3.9.3. Reverse-transcription PCR and Quantitative 
Real-time RT-PCR 

The powerful amplification of nucleic acids achieved by the PCR 
methodology makes it an excellent method to use to detect low-abundance 
nucleic acids. Since RNA cannot serve as a template for PCR, the 
first step to quantify RNA is to reverse transcribe the RNA sample 
into cDNA. Reverse transcription PCR (RT-PCR), not to be confused 
with quantitative real-time RT-PCR (QRT-PCR), can be used for semi- 
quantitative assays of RNA abundance, simply by sampling the PCR 
mixture followed by gel quantification of the amplified product. Yet, 
as the name implies, this is at best only semi-quantitative. The ad- 
vent of fluorescence techniques applied to the reverse transcription PCR 
methodology, together with instrumentation capable of combining am- 
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plification, detection, and quantification, has revolutionized the possi- 
bilities of nucleic acid quantification. 
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Figure 3.7. Northern blot assay of gene expression and corresponding microarray 
data. (A) Samples from the indicated time points were assayed by Northern analysis 
(19). Genes were chosen to be representative of the four previously identified temporal 
classes. DMC1, SPS1, DIT1, and SPS100 belong to the early, middle, mid-late, and 
late classes, respectively (1, 2). (B) RNA samples from the same time course as in 
(A) were analyzed by microarray analysis. Data are graphically displayed with color 
to represent the quantitative changes. Increases in mRNA (relative to pre-sporulation 
levels) are shown as shades of red, and decreases in mRNA levels are represented by 
shades of green. Source: Science, volume 282 (1998). Reprinted with permission. 



In quantitative real-time RT-PCR (QRT-PCR) the amplification can 
be monitored by fluorescence in “real-time” in every cycle of amplifica- 
tion. There is already a wide variety of specialized methods and reagents 
utilized in QRT-PCR. The most commonly used chemistries include 5' 
nuclease assays using TaqMan probes, molecular beacons, and SYBR 
green I intercalating dyes, all providing means to monitor the accumula- 
tion of PCR product in each cycle of the PCR reaction. The fluorescence 
values recorded during every cycle represent the amount of PCR prod- 
uct amplified to that point in the amplification reaction. The point at 
which a statistically significant signal can be recorded above background 
is called the threshold cycle. The more template present at the begin- 
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ning, the fewer PCR cycles that are required to reach the threshold 
cycle. The threshold cycle is inversely proportional to the logarithm of 
the starting amount of template in the PCR reaction. By construct- 
ing standard curves of known amounts of starting template, unknown 
samples can be quantified very accurately. See Ginzinger 27 (2002) for 
a review of gene quantification using quantitative real-time PCR and 
Bustin 28 (2002) for a review of reverse transcription PCR. 
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Chapter 4 



INHERENT VARIABILITY 
IN MICROARRAY DATA 



This chapter begins with a statistical characterization of the genetic 
populations that are investigated in microarray studies. It then goes 
on to examine the nature of microarray gene expression data that are 
generated from these populations and their sources of variability and 
error. 

4.1. Genetic Populations 

The target objects in a microarray study might be cell lines (such 
as yeast cells), biological specimens (such as tumor tissues), or biolog- 
ical systems under varying experimental conditions (such as organisms 
exposed to different levels of a toxin). For expository convenience, we re- 
fer to these target objects as either biological specimens or experimental 
conditions. These phrases may be replaced by equivalent terms, such as 
cell lines, tissues, and so on, when discussing particular applications of 
microarray technology. The index set of biological specimens or experi- 
mental conditions under investigation is denoted by J\f = (1,2,... ,N}, 
where N denotes the total number of sample specimens or conditions in 
the study. 

Each biological specimen under investigation defines a distinct biolog- 
ical population with a particular set of genes. The union of these gene 
sets for all N specimens forms the population gene set for the study, 
denoted by Qp. Thus, the genes of any specimen in the study will be 
found in the set Qp. The word ‘gene’ is used here to encompass a wide 
range of real genes, DNA strands, and other biologically coded objects 
that can bind to or hybridize with probes mounted on an array slide. 
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The designated gene set in a microarray study refers to the set of 
probes that are spotted on the array. The designated gene set on the ar- 
ray is denoted here by the index set Qa = {g ■ 9 = 1, 2, . . . , G}, where G 
is the total number of probes in the designated set. In practical applica- 
tions of microarray technology, the designated gene set for a microarray 
study often differs from the true population set Qp, possibly to a great 
extent. The difference between the population gene set Qp and the des- 
ignated gene set Qa on the array will usually be unknown, with set Qa 
containing genes that may not be in Qp, and vice versa. In addition to 
genes in the population gene set Qp, the designated set Qa may contain 
foreign genes, gene replicates, cDNA fragments, other gene-like objects, 
and even blanks. Foreign genes are genes that are alien to the biologi- 
cal specimens under study but are purposely included in the microarray 
for reasons of control, calibration, and monitoring of the array results. 
The presence of replicates implies that not all spots represent genetically 
distinct probes. Replication offers the benefits of improving statistical 
precision and diagnostic checking. 

To characterize the biological population, we consider the genetic 
makeup of the population in relative terms. The genetic material in each 
biological specimen has different concentrations of the distinct genes in 
the designated set Qa on the array. We shall denote the concentration 
of genetic material in specimen n that is attributable to gene g by Q gn . 
By definition, £ gn > 0 for each gene g and specimen n. If ( gn = 0, then 
gene g is not present in biological specimen n. In microarray studies, 
the values of ( gn are quantities of interest and they are invariably un- 
known. Therefore, they must be estimated. More often, ratios of the 
form Cgn'/Cgn for pairs of distinct experimental conditions n' and n are 
of interest. When these ratios are not unity, they indicate that genep is 
up- or down-regulated under one experimental condition relative to the 
other. Observe that the concentrations Q gn convey no information about 
genes that are in the population gene set Qp but are missing from the 
designated set Qa on the array. 

A key design consideration in microarray studies is the composition 
and size of the designated gene set Qa- The number of genes G in the 
gene set may be in the hundreds or in the thousands. The number 
of these genes that are expected to be differentially expressed between 
different biological specimens or experimental conditions may be very 
small, numbering perhaps in the dozens. One of the scientist’s chal- 
lenges is to cast the net for designated genes widely enough to include all 
genes that are implicated in a particular scientific question but narrowly 
enough that spurious or chance indications of differential expression are 
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not numerous. In a later chapter we re-examine this issue in the context 
of sample size and power issues for microarray studies. 

Expression levels for genes are dictated physiologically by a combina- 
tion of genetic factors such as promoters, enhancers, and splice sites and 
a large number of environmental factors, including temperature, stress, 
and light, which can lead to changes in the levels of hormones and other 
signalling substances that affect gene expression levels 1 . Like mRNA 
levels, protein levels also change in response to physiological factors and 
changes in environment. Unlike DNA and RNA, however, protein activ- 
ity is not measured by hybridization but rather by binding of proteins 
to labelled oligonucleotides 2 . 

4.2. Variability in Gene Expression Levels 

A microarray study extracts a sample of genetic material from each 
specimen and through a laboratory and measurement routine obtains 
a reading of genetic intensity for each gene in the designated gene set 
Ga ■ Many factors influence the relationship between these readings and 
the genetic population concentrations ( !jn , making it difficult to estimate 
the concentrations from the microarray data. Indeed, the purpose of this 
book is to address this exact problem. 

4.2.1. Variability Due to Specimen Sampling 

Sampling of genetic material for microarray studies generally takes 
place at several levels. Any discussion of variability must be clear about 
which sources of variability are under consideration. The first level of 
sampling is usually encountered when the biological specimens are se- 
lected in some broad context. For example, a uterine tumor of a partic- 
ular host patient may be selected as the specimen. If there are several 
such tumors in the host, however, then the one chosen for the study is a 
sample from among all tumors present in the host and, indeed, may be 
viewed as a sample of all future tumors that might develop in the host. 
In subsequent discussion, the selected specimen will usually be taken as 
the starting point of the statistical investigation. Inferences will relate 
to this biological specimen and not to more general populations that 
are farther back in the selection process. This definition of a biological 
specimen is a convenient operating definition. It is not meant to sug- 
gest that inferences farther back to more general populations are not of 
interest. In some applications, in fact, study design and analysis may 
need to relate to a very general level of definition. For example, it is 
conceivable that a uterine tumor study might take the relevant tumor 
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population as all those that are extant in U.S. females of a given race 
in a specified age bracket at a given point in time. The connection be- 
tween such a general population and the microarray results for a small 
sample of uterine tumors taken from a few women in this population 
is very distant and tenuous, spanning sampling variability introduced 
at several intervening stages of sample selection. In order to generalize 
the findings to the general population of women, the microarray studies 
should be based on a random sample taken from the general patient pop- 
ulation. To reiterate, the following discussion and analysis will assume 
that the immediate biological specimens in the laboratory are the target 
biological populations of interest for statistical inferences. 

Even after the biological specimen is in hand, variability in expression 
measurement can arise from many factors. One of the first sources of 
variability encountered in many microarray studies is that produced by 
selecting the sample of genetic material from the population of interest. 
For example, a sample of tumor tissue must be selected from a patient’ s 
tumor in a microarray study of uterine cancer. It is clear that the 
sample material may vary to the extent that the tumor is not a uniform 
biological object. The genetic composition of the sample may differ 
depending on whether the core biopsy is taken, for instance, from the 
peripheral zone or the transition zone of the tumor. The microarray 
study design must take sampling variability of this kind into account. 
For example, the design might call for taking a systematic selection of 
several samples from different parts of the tumor. 

4.2.2. Variability Due to Cell Cycle Regulation 

In a microarray study based on samples from yeast cultures synchro- 
nized by three independent methods, Spellman et al. 3 (1998) created 
a catalog of 800 yeast genes whose transcript levels vary periodically 
within the cell cycle. They note that, in addition to random fluctuations 
in the data, cross-hybridization between genes whose DNA sequences are 
similar can produce false positives when only one of the genes is actu- 
ally cycle-regulated. False positives can also occur when an unregulated 
gene overlaps the mRNA for a cell cycle-regulated gene; the cDNA cor- 
responding to the regulated gene would hybridize with the unregulated 
gene’s DNA. 

4.2.3. Experimental Variability 

Microarray experiments can encounter technical problems at any step. 
Few microarray images are perfect. Various systematic errors in microar- 
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ray measurements may exist because of the preparation of probes, tar- 
gets, and arrays as well as the procedures of image analysis. Variations 
can also be laboratory- or system-dependent. Schuchhardt et al. 4 (2000) 
present a systematic study of sources of distortion and noise in microar- 
ray studies involving spotting techniques. Investigations by Wang et al. 5 
(2001) and by Yang et al. 6 (2001) also show how numerous the potential 
sources are and how significant each might be in application. Listed be- 
low are some general sources of variation that are common to microarray 
experiments. 

Preparation of mRNA. 

As we mentioned earlier, the fluor-tagged cDNAs from a complex 
mRNA mixture extracted from cells are often termed targets. De- 
pending upon tissue and sensitivity to RNA degradation, targets may 
be different from sample to sample. 

Reverse transcription. 

Reverse transcription to cDNA may result in DNA species of varying 
lengths. 

PCR amplification. 

Clones are subject to PCR amplification. The amplication is difficult 
to quantify and may fail completely. 

Pin geometry and transport volume. 

Pin geometry can produce systematic variation. Pins have different 
characteristics and surface properties. The amount of transported 
target volume can fluctuate randomly even for the same pin of the 
arrayer. 

Slide heterogeneity. 

The fraction of target cDNA that is chemically linked to the slide 
surface from the droplet is unknown. Furthermore, the target may 
be distributed unequally over the slide or the hybridization reaction 
may perform differently in different parts of the slide. 

Fluorescence labeling. 

Depending upon nucleotide composition, radioactive (fluorescence) 
labeling may also fluctuate. 

Hybridization reaction. 

The efficiency of the hybridization reaction is influenced by a num- 
ber of experimental parameters, notably temperature, time, buffering 
conditions and the overall number of target molecules used for hy- 
bridization. Cross-hybridization within a gene family may also occur. 
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Nonspecific hybridization is an error source that cannot be completely 
excluded. 

Non-specific background and over-shining. 

Non-specific radiation (fluorescence) and signals from neighboring 
spots can be present. 

Image processing and data acquisition. 

Results from image analysis can be affected by background noise and 
overshining from neighboring spots, non-linear transmission charac- 
teristics, and the procedures used in handling saturation effects and 
variations in spot shapes. 

In general, the exact gene expression levels are unknown, and par- 
ticular strategies have to be developed to quantify systematic errors in 
microarray experiments. A variety of correction methods can be found in 
the literature, including comparison of duplicated spots to quantify the 
variability for the same array and the same pin; analysis of control spots 
to quantify the variability from pin to pin and variations across the filter; 
checking the reproducibility on different filters; analysis of empty back- 
ground spots for non-specific noise and overshining; and use of dilution 
series of the target 7 . Brazma et al. 8 (2001) proposed the Minimum In- 
formation about a Microarray Experiment (MIAME) as a standard for 
recording and reporting microarray-based gene expression data. Any 
single microarray output is subject to substantial variability even under 
the relatively controlled conditions of an experiment. It is advisable to 
consider appropriate experimental designs and perform multiple stages 
of quality control before hybridizing valuable experimental samples. 

4.3. Test the Variability by Replication 

Replication serves to sharpen the precision of statistical inferences 
drawn from microarray studies. Replication also helps to evaluate whether 
data from arrays are uniform in some appropriate statistical sense, pro- 
viding a quality check on the scientific investigation. 

4.3.1. Duplicated Spots 

To check the consistency of microarray experiments and to study the 
effects of variability and replication on the reliability of cDNA microar- 
ray findings, Lee, Kuo, Whitmore, and Sklar 9 (2000) conducted a study 
to investigate whether the locations of cDNA spots on slides may pro- 
duce variation in measurements of transcriptions. A single human tissue 
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sample formed the target biological sample. Only output from channel 1 
(green) contained expression readings for the target tissue. Output from 
channel 2 (red) contained noise alone. The study design consisted of 288 
genes, each printed at three locations on the same slide. By compar- 
ing the signals from these triplicates, Lee et al. evaluate the minimum 
variability that is likely to be inherent in a microarray system and leam 
more about the reproducibility of the array process and the outcome 
of analysis. The experiment was designed so that 32 of the 288 genes 
would be expected to be highly expressed because of Alu repeats that 
should cross-hybridize to similar sequences widely distributed among ex- 
pressed and unexpressed portions of the human genome. Results based 
on individual replicates, however, show that there are 55, 36, and 58 
highly expressed genes in replicates 1, 2, and 3, respectively. On the 
other hand, we will show in later chapters that by applying appropriate 
statistical methods one can pool the readings from the three replicates 
and obtain more accurate analytical results such that only 2 of the 288 
genes are incorrectly classified as expressed. As a result, a minimum of 
three replicates is recommended in a microarray study. This replication 
test data set is used to demonstrate a number of points later in this text. 



4.3.2. Multiple Arrays and Biological Replications 

There are different levels of biological replication. Depending upon 
the nature of the microarray study, the sampling units may consider 
biological replicates, e.g., from inbred species. If inbred species are not 
readily available, species of the same strain can be used. In applications 
with human tissues, variation between individuals could be much larger 
than other sources of variation, and hence sampling from a large num- 
ber of individuals will be necessary. When multiple specimens from the 
same individuals are available, they can sometimes be considered repli- 
cates. For tumor tissues, however, even multiple specimens from the 
same tumor can have considerable variation. Moreover, if RNA sam- 
ples are divided into multiple plates for hybridization, multiple arrays 
can be made based on each RNA specimen. These replications can be 
very useful in testing the reproducibility of the findings based on array 
data. Baggerly et al. 10 (2001) propose parametric models for log ra- 
tios obtained from replications within a sample (within a channel) and 
replications between samples (between arrays and/or between channels). 
Wang et al. 11 (2003) proposed quantitative quality control measures for 
microarray experiments. 
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Chapter 5 



BACKGROUND NOISE 



The reading or measurement for gene expression is usually a fluores- 
cence intensity measurement or some other quantitative marker. The 
reading may have undergone various adjustments within the instrument 
system, such as calibration. Thus, any description of gene expression 
data must be accompanied by an explanation of how the values are 
produced by the instrument system. The expression measurements will 
invariably contain a component that represents background noise. This 
chapter considers the nature of this noise and methods for taking it into 
account. 

5.1. Pixel-by-pixel Analysis of Individual Spots 

In cDNA arrays, gene expression corresponds to the fluorescence in- 
tensity of an image that is measured pixel by pixel. The discussion 
begins by considering an analysis at the pixel level. To demonstrate a 
pixel-by-pixel analysis, we cite a study on budding yeast Saccharomyces 
cerevisiae considered by Brown et al. (2001). In this study, the array 
compared two different strains (A and B) of wild-type S. cerevisiae. Fol- 
lowing methods described in Schena et al. 2 and DeRisi et al. 3 , mRNA 
from strain A was copied to cDNA and labeled with Cy3 (green), and 
mRNA from strain B was copied to cDNA and labeled with Cy5 (red). 
The cDNAs were hybridized to a spotted DNA microarray. The array 
carried about 6,200 genes. Images of the hybridized arrays were obtained 
from a modified fluorescence microscope that scans the slide at several 
wavelengths. 
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Cy3 Intensity (digital units) 




Figure 5.1. Methods for determining local background. 

(a) Gene expression graph of Cy5 vs. Cy3 intensity for 6,200 yeast genes with local 
background used to offset signal intensities. Not shown are 98 spots with negative 
intensities. (Inset) Local background was determined by averaging red and green 
intensities over 16 pixels at each of the four corners (marked B) of a rectangular 
region surrounding each spot, (b) Distribution of Cy5 intensities for a region of 
the slide away from the hybridization area (dotted blue line), a spot with negative 
intensity (a “black hole,” red line), local background surrounding the black hole (solid 
blue line), and a nearby low- intensity spot (green line). Each distribution is derived 
from over 150 sampled pixels, (c) Gene expression graph as in (b) but with best-fit 
background derived by Brown et al. (2001). Only one (obviously scratched) spot has 
negative intensity and is not shown. Source: Proceedings of the National Academy of 
Sciences (PNAS), volume 98, Copyright (2001) National Academy of Sciences, U.S.A. 
Reprinted with permission from PNAS. 
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Figure 5.2. Analysis of spot morphologies. 

(a) A gallery of spots from cDNA-based microarrays and their corresponding red- 
green ratio is graphed three-dimensionally. Clockwise from the top left: a high- 
quality spot, a spot exhibiting dye separation, a scratched spot, and a clumped spot 

(b) relationship between average signal intensity and the standard deviation of the 
pixel-by-pixel intensities for all 6,200 spots. The red dotted line shows the trend- 
line; the green-dotted line shows expected instrument noise based on photon counting 
statistics. Source: Proceedings of the National Academy of Sciences (PNAS), volume 
98, Copyright (2001) National Academy of Sciences, U.S.A. Reprinted with permission 
from PNAS. 
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The fluorescence signal from a spot, having an approximate diameter 
of 100/xm, is registered on roughly 200 pixels. Thus, each spot provides 
a large number of red and green intensity measurements. Brown et al. 
determine the amount of real and background fluorescence at each spot 
on the array. The mean intensity for a spot, averaged over all pixels, 
consists of (1) fluorescent signal from cDNA hybridized to the spot, 
(2) background fluorescence arising from non-specific binding, and (3) 
unwanted contributions from noise and proton counting error. 

The authors determined the background arising from non-specific 
binding by measuring local background intensity at the four corners of 
a rectangular region of interest surrounding the spot (Figure 5.1). They 
found that local background was an inexact measure of non-specific flu- 
orescence for microarray spots. When absolute intensities were com- 
pared on a pixel-by-pixel basis for a spot that (a) looked like a “black 
hole,” (b) a region surrounding the black hole (the local background), 
(c) a weakly fluorescent spot, and (d) a point on the slide outside the 
hybridization area, the observed intensity distributions showed signifi- 
cant overlap. The center of the black hole was clearly less intense than 
the local background, and the weakly fluorescent spot was only slightly 
brighter. The problem of negative intensities arises not because the spot 
is incorrectly located during image segmentation but rather because a 
more-fluorescent probe is actually binding to the area surrounding the 
spot than to the spot itself. 

To explore the intensity variations systematically, the authors plotted 
the standard deviation of the pixel-by-pixel intensities for each spot (av- 
eraged across both channels) against the spot’s average signal intensity. 
Figure 5.2 shows that the standard deviations in the pixel-by-pixel in- 
tensity distributions are large in absolute magnitude and rise linearly as 
the signal intensity increases. In contrast, measurement noise, including 
photon counting noise, rises only as the square root of the signal. 

Because the microarrays analyzed in the authors’ study did not con- 
tain hybridization standards, they derive a computational method to 
determine background levels from the experimental spots themselves 
and conclude that pixel-by-pixel information for microarray images can 
be used to formulate measures that assess the accuracy with which an 
array has been sampled. 

5.2. General Models for Background Noise 

Instead of pixel-by-pixel analysis, we discuss in this section some gen- 
eral models for handling microarray background noise. 
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The full data set consists of gene expression measurements w gn for all 
genes g = 1, . . . , G G Qa and for all specimens n = 1, . . . , N € M. The 
w gn are usually arranged in a G x N matrix of values, with genes corre- 
sponding to rows and specimen samples corresponding to columns. Dif- 
ferent microarray platforms yield different types of expression measure- 
ments. The exact relationship of measurement w gn to the true concen- 
tration Cgn of gene g in specimen n depends on the technology, imbedded 
adjustments that have been used, and, importantly, on the background 
noise that is present. We defer the discussion of missing and saturated 
intensity values to later chapters. 

5.2.1. Additive Background Noise 

A plausible statistical description for the gene expression reading w gn 
is that of a sum of two components, one representing latent (i.e., unob- 
served) true gene expression x gn and the other representing background 
noise Bgn, as follows. 



w. 



gn 



Xg n “I" 



gn 



(5.1) 



The true unknown gene expression component x gn is assumed to increase 
proportionally with ( gn , the true concentration of gene g in specimen n. 
Thus, x gn = 0 if ( gn = 0; in other words, the x gn component is absent 
from w gn if the specimen n does not contain gene g. The background 
noise component B gn is assumed to be mathematically independent of 
C gn - The assumption is also made that x gn and B gn are additive and 
probabilistically independent. Both components are taken as nonnega- 
tive. Observe that additive noise gives an incorrect indication of genetic 
material being present even when none is truly there (a potential false 
positive). 

When gene g is absent from specimen n, then the reading w gn is a pure 
observation of background noise, i.e., w gn = B gn . If all background noise 
values constitute a random sample from a common noise distribution, 
then genes that are known to be absent from the specimen yield a sample 
of observations from the noise distribution and can be used to infer its 
form and parameter values. A later discussion of ANOVA models for 
gene expression data exploits this observation. Figure 5.3 displays a 
histogram of gene expression readings for a data set consisting of 864 
cDNA spots that is discussed in detail later. The left-hand portion of 
the histogram largely reflects the noise distribution. 
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5.2.2. Correction for Background Noise 

The nature of noise and its correction depends on the microarray 
technology being used. Data from spotted arrays provide an uncor- 
rected measure of expression intensity w gn , as well as an estimate of the 
background noise B gn . If this background estimate is denoted by B gn , 

(c) 

then a background-corrected expression reading w gr l is provided by 

'Wgn = w gn ~ B gn = ( X gn + B gn ) — B gn (5*2) 

The ScanAlyze system, for example, outputs measures CHI I and CH2I, 
which are uncorrected mean pixel intensities for the array spot for two 
fluorescent hybridizations 4 and also produces background corrections 
CHIB and CH2B for the same channels. 

The correction procedure (5.2) may yield negative values for gene ex- 
pression, especially where component x gn is small (or zero) or the back- 
ground estimate B gn is large. The following variant of the background- 
corrected reading is obtained when negative values from (5.2) are set to 
zero. 



= max{w gn - B gn , 0} (5.3) 

If Wgn is zero (or negative), the inference is generally made that C, gn = 0, 
i.e., that gene g is absent from specimen n. 

The simple subtraction of a background correction provided by the 
instrument software does not necessarily produce a better reading for 
gene expression. It must never be forgotten that B gn is only an estimate 
of B gn , and a bad estimate may be worse than having no estimate at all. 

The difference between the true and estimated noise, i.e., 

Bgn ~ Bg n , (5.4) 

appears in the background-corrected expression reading (5.2). It is 
hoped that the true and estimated background noises are exchange- 
able random variables, meaning that if g(b, b ) is their joint probability 
density function (p.d.f.), then g(b , b ) = g(b, b ) for all outcome values b 
and b. This condition would imply that B gn and B gn share the same 
mean and variance and that their difference B gn — B gn has a mean of 
zero, i.e., E(B gn — B gn ) = 0. One would also expect that the differ- 
ence has a smaller variance than the original background noise, i.e., 
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& 2 (B gn — B gn ) < a 2 (B gn ). If the estimate B gn imitates the first two 
moments of the true background noise B gn in the sense of having the 
same mean and variance, then it can be shown that B gn and B gn are 

moderately correlated if the corrected intensity measure w g n is to be less 
variable than the original intensity reading w gn . Specifically, the true 
background noise and its estimate must have a correlation coefficient 
exceeding 0.5 if the background correction is to be helpful in this sense. 

Diagnostic checks are available to verify whether background noise 
correction has been effective. The checks are not conclusive but can be 
helpful. The most effective checks involve examining genes suspected of 
having no true gene expression. These will be genes for which w gn is 
small. This judgment can be made using a histogram such as that in 
Figure 5.3. One can select a cutoff point at which it appears that the bulk 
of genes found to the left of the cutoff mainly reflect simple noise with 
no true gene expression. For genes to the left of the cutoff, the observed 
expression level w gn will consist largely of the pure background noise 
B gn . For these genes, the B gn and their estimates B gn should be close 
if the background estimates are reliable. This proximity can be checked 
in a plot of w gn against B gn for w gn smaller than the cutoff. 

As background correction is never perfect, model (5.1) is also valid 
for background-corrected data because each reading retains two compo- 
nents. The component reflecting latent gene expression, namely x gn , is 
present. Also, the difference Bgn — B gn remains as a residual background 
noise component. Now, however, the residual background component 
may assume both positive and negative values. Of course, if the back- 
ground correction has been effective, the residual background component 
will tend to have a smaller magnitude than the original background noise 
component. 

In the remainder of this chapter and those that follow, the context 
will make it clear whether the gene expression readings under discussion 
are background-corrected. We will only use superscript (c) where there 
is a need for explicit notation. 

5.2.3. Example: Replication Test Data Set 

The replication test data set introduced in Lee et al. (2000) will be 
used to illustrate some of the concepts and methods in this section. As 
was mentioned in section 4.3, the data set was produced from a small ex- 
periment that aimed to study the effects of variability and replication on 
the reliability of cDNA microarray findings. The study design consisted 
of 288 genes, each printed at three locations on the same slide. A single 
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human tissue sample formed the target biological sample. The experi- 
ment was designed so that 32 of the 288 genes would be expected to be 
highly expressed because of Alu repeats 5 that should cross-hybridize to 
similar sequences widely distributed among expressed and unexpressed 
portions of the human genome. Only output from channel 1 (green) 
contained expression readings for the target tissue. Output from chan- 
nel 2 (red) contained noise alone. Counting the 288(3) = 864 spots as 
the designated gene set, we have G = 864. There is only one biological 
specimen, so N - 1. Note that the designated gene set Qa on the ar- 
ray contains triplicates of 288 distinct genes. This data set is used to 
demonstrate a number of points in the discussion that follows and also 
later in the text. 

The readings of gene expression w g \ for the 864 cDNA spots are dis- 
played in a histogram in Figure 5.3. Here index n = 1 because only one 
specimen is under consideration. The histogram shows the gene expres- 
sion data in terms of their common logarithms (i.e., logarithms to base 
10). The gene expression data are the output denoted as CHI I in the 
ScanAlyze system, which are the uncorrected mean pixel intensities of 
spots for the green fluorescent hybridization (Eisen, 1999). Observe the 
large concentration of small readings that generally correspond to noise. 
A scattering of larger readings also appear that are mainly associated 
with gene probes that should be highly expressed. The logarithmic scale 
amplifies the detail in the lower range of the data. 

The noise component of the data appears unimodal and roughly sym- 
metrically distributed. The distribution of the smaller number of ex- 
pressed genes stands out as a separate distribution to the right end of 
the histogram scale. Thus, the distribution pattern of the data looks 
very much like a mixture of gene expression readings for a large num- 
ber of unexpressed genes (noise alone) and a smaller number of genes 
expressed to varying degrees. 

Taking the Replication Test Data Set as a case example, a log- value 
of 3.8 is used as a cutoff for unexpressed genes (based on a rough visual 
judgment for Figure 5.3). It is found that 761 of the 864 spots lie below 
this cutoff level. As the experiment involved 256 gene triplicates that 
should be unexpressed, the count 3(256) = 768 closely matches this 
count of 761. Among the 761 spots, a plot of w gn against B gn shows 
a fairly clear linear relationship but one that does not follow the line 
of identity. The scatter plot and line of identity appear in Figure 5.4. 
The correlation coefficient for w gn and B gn is 0.773. What is a little 

(c) 

more surprising is that only 9 of the background-corrected readings w gr i 
for the 761 unexpressed genes are negative. If the background noise 
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Figure 5.3. Histogram of log-expression data for the Replication Test Data Set (using 
common logarithms) 

estimates are unbiased estimates of their true values, then about half 
of the unexpressed genes would be expected to have negative values for 

(c) 

Wgn. 

The plot shows clearly that the values of w gn for the 761 spots are 
consistently larger than their counterpart background noise estimates 
B gn (i.e., they lie above the line of identity). This fact is also indicated 
by a comparison of their mean values; 2250 being the mean for w gn and 
1722 for B gn . Thus, the results of the diagnostic check suggest that 
background correction for this case example is moderately successful at 
best. 

The case example is convenient because there is a clear separation 
of expressed and unexpressed genes (Figure 5.3). In many applications, 
however, the separation is not as clear, and the set of unexpressed genes 
required for the diagnostic test will not be so easy to discern. 



62 



ANALYSIS OF MICROARRAY GENE EXPRESSION DATA 



CD 

E 

y 

x 

u 



o - 



CHIB 



• CHI 1 









5000 




Figure 5.4 ■ Scatter plot of expression intensities and background estimates for 761 
genes classified as unexpressed. The plot also shows the line of identity. 



5.2.4. Noise Models for GeneChip Arrays 

On the basis of the difference measure PM-MM from Affymetrix 
GeneChip arrays, Li and Wong 6 (2001) introduce the following noise 
model 

PMjj — M M ij = 6i4>j + j, (6.5) 

where 0j is the fitted expression index of sample i, i = 1 ,...,n s , f tJ 
~ N(0, a 2 ) is the error term, and <fij is the sensitivity of probe j , j = 
1 , . . . , ri p , with the constraint J2j4>j = n p i n order that the solution be 
unique. 

Sasik, Calvo, and Corbeil 7 (2002) point out that, under the Li and 
Wong model in (5.5), the expression index 0* can be negative. Also, 
genes with negative 6 can still be classified as present (as can genes with 
negative average difference in the Affymetrix Microarray suite 8 (5.0)). 
Hence, they propose a model that does not suffer from the above prob- 
lems. It is based on an assumption similar to that underlying the Li- 
Wong model, that fluorescent intensity of a PM probe (properly adjusted 
for background B) is directly proportional to the concentration Q of the 
transcript on sample i, that is, 



PMy — B ~ (f)jCi 



(5.6) 



BACKGROUND NOISE 



63 



Taking the logarithm (to base 2) of both sides, the multiplicative model 
in (5.6) can be written as an additive model on the logarithmic scale 

log 2 (PMy - B) ~ log 2 <t>j + log 2 Ci (5.7) 



Defining raj = log 2 (PMy - B), ipj = \og 2 4>j, and 7 , = log 2 Ci, then 
the model can be written as 

Vij ~ 'tftj + 7 * + e ij (5.8) 

where €ij ~ N( 0 ,<t 2 ). There is one such equation for each transcript, 
and they allow for the possibility that the variance a 2 of the error term 
is different for each transcript. In model (5.8), regardless of the fitted 
values of ipj and 7 the probe sensitivities <f>j and concentrations Q can 
never be negative, because <pj = 2 ^ > 0 and £ = 2 7i > 0 . 

5.2.5. Elusive Nature of Background Noise 

The preceding sections have discussed the nature of background noise 
and statistical models for its correction. It was noted that background 
correction must be checked to ensure that it has been beneficial. Several 
investigations have shown that the subject is elusive for both spotted 
arrays and in- situ oligonucleotide arrays. 

Many researchers have chosen to use foreground intensities from spot- 
ted arrays without background correction because of concerns regarding 
the effectiveness of subtracting the machine-generated background noise 
measure. 

One of several problems with background measurement is image seg- 
mentation. From a statistical perspective, image segmentation is a data- 
dependent procedure. Segmentation in spotted microarrays may yield 
biased intensity readings because array spots tend to be centered on 
those regions of the image with the brightest fluorescence. This bias 
is illustrated by the case example that considered the analysis of back- 
ground correction for the Replicate Test Data Set in section 5.2.3. Figure 
5.4 revealed that the background intensity calculated by the ScanAlyze 
system in this illustration appeared to be too small, on average, leading 
to background-corrected intensities that were predominantly positive for 
genes that were known to be unexpressed. 

The difficulty of separating signal from noise has also been noted in 
in-situ oligonucleotide microarrays. Naef et al. 9 (2003), for example, 
note that, whereas conventional wisdom views the PM probes as car- 
rying the signal while MM probes serve as non-specific controls, their 
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experience shows that the MM probes actually track the signal. They 
state that “the MMs should be viewed as a set of average lower affinity 
probes”. They then go on to develop a method to exploit the signal 
content of the MM probes. Along a similar line, Irizarry et al . 10 (2003) 
remark that “Recent results . . . suggest that subtracting MM as a way 
of correcting for non-specific binding is not always appropriate”. They 
cite two sources for this remark and go on to say that “until a better 
solution is proposed, simply ignoring these values is preferable.” 
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Chapter 6 



TRANSFORMATION 
AND NORMALIZATION 



6.1. Data Transformations 

Gene expression readings w gn will not necessarily have desired sta- 
tistical properties without being transformed. Desirable properties may 
include normality or constant variance, for example. Here we discuss 
transformations of the data and related issues that have proven useful 
in analyzing microarray data. We shall let y gn = h(w gn ) denote the 
transformed value where h is a selected mathematical function. We will 
limit our attention to monotonic increasing transformations, i.e., those 
for which transformed measurements y increase with observed measure- 
ments w. 



6.1.1. Logarithmic Transformation 

By far the most common transformation applied to microarray read- 
ings is the logarithmic transformation 

y g n = log W gn . (6.1) 

The base of the logarithm may be 2, 10, or the natural logarithmic 
constant e. The choice of base is largely a matter of convenient inter- 
pretation. 

There are several issues that arise with the use of a logarithmic trans- 
formation. If the readings are initially corrected for background noise 
as described in (5.2) or (5.3), then the background-corrected transformed 




68 



ANALYSIS OF MICROARRAY GENE EXPRESSION DATA 



value, which we denote by 

Vgn = log w $ , (6.2) 

will be undefined if w gr i is zero or negative. Thus, the transformation is 
reserved for readings where intensity exceeds background. This fact will 
affect the interpretation of statistical models and findings. 

A logarithmic transformation is used for microarray data because it 
tends to provide values that are approximately normally distributed and 
for which conventional linear regression and ANOVA models are appro- 
priate. If the transformed data are approximately normal, the implica- 
tion is that the raw readings are lognormally distributed. The multi- 
plicative version of the central limit theorem can be used as a partial 
justification for the assumption of a lognormal form for the components 
of additive background noise model (5.1). For example, both the latent 
expression component x gn and noise component B gn of the microarray 
reading w gn may be generated as the result of many independent mul- 
tiplicative random effects that produce the final values of these compo- 
nents. As it happens, however, even if both x gn and B gn are independent 
and lognormally distributed, their sum will not have this kind of distri- 
bution. Thus, the log-transformed values y gn can only be approximately 
normal in this situation. 

The final test of whether log-intensity readings are appropriate for sta- 
tistical analysis of microarray data is whether reliable and useful results 
can be derived. Various diagnostic methods are available for checking for 
major departures from normality and other model failures. For example, 
normal probability plots can be used to provide a check on normality. 
Some of these diagnostic methods will be demonstrated in case studies 
taken up later. As already noted, the log-transformation is widely used 
and has been well defended by investigators in many studies. Thus, it 
is a key transformation for microarray gene expression data. Figure 5.3 
shows the distribution of such data after a logarithmic transformation. 

6.1.2. Square Root Transformation 

It is reasonable to expect that intensity readings in microarrays will 
be proportional to the number of occurrences of fundamental molecular 
events such as hybridizations. The constant of proportionality would be 
the quantum of fluorescence, radiation, or other signal produced by a sin- 
gle fundamental event. The fundamental events would fall into two cat- 
egories: a homogenoeus category representing hybridizations that count 
as true expression for a gene and a heterogenous category of events (such 




TRANSFORMATION AND NORMALIZATION 



69 



as non-specific binding) that represent noise. The latter category con- 
tains events that are not of scientific interest but nonetheless are being 
counted by the microarray instrument. A plausible model for the number 
of each type of event is a Poisson distribution. As the gene expression 
and noise components in model (5.1) are assumed to be independent and 
the sum of independent Poisson variables remains Poisson distributed, 
it follows that the gene expression reading w gn will be proportional to a 
Poisson random variable. The Poisson model implies that the variance 
of w gn is proportional to its mean value. Microarray data often have 
variability that tends to increase with the level of the reading. If this 
relation is one in which the variance increases in proportion to the mean, 
then the preceding Poisson model is plausible. 

For microarray data that follow this Poisson model, the square root 
function 



Vgn — \JlVgn ( 6 . 3 ) 

is a variance- stabilizing transformation. Thus, the transformed values 
y gn will tend to have a constant variance. For the Replication Test Data 
Set, the histogram of square-root transformed data is not much different 
in appearance than that shown for log-transformed data in Figure 5.3. 



6.1.3. Box-Cox Transformation Family 

The two preceding transformations are two members of the Box-Cox 
family of transformations. This family is defined as follows. 



Vgn — 






( 6 . 4 ) 



The square root transformations correspond to parameter d being 1/2. 
The logarithmic transformations correspond to the limit of equation 
(6.4) when d approaches 0. The case where d— 1 corresponds to taking 
no transformation at all, except for a change in origin. 

The Box-Cox family provides a range of transformations that may 
be examined to see which value of d yields transformed values with the 
desired statistical properties. 



6.1.4. Affine Transformation 

It is sometimes desirable or necessary to shift the origin of the intensity 
measurement scale by some fixed amount, say a. Thus, if w gn is the 
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intensity reading, then a + w gn is the adjusted reading. We shall refer 
to a as a shift parameter or offset and to the transformation as an affine 
transformation, meaning a shift away from the origin. 

The affine transformation may be combined with other transforma- 
tions such as the logarithmic transformation, in which case we have 

y gn = log (a + w gn ). 

As the logarithmic transformation requires positive arguments, the affine 
transformation may be used with the logarithmic transformation where 
intensity readings have been background-corrected and some readings 
are negative. In this case, the offset parameter a would be chosen so 
that a + Wgn is positive for all readings. 

Kerr et ah' (2002) have found that applying opposing affine trans- 
formations to the two color intensities on a spot of a cDNA array may 
improve the correspondence of intensity readings. Denoting the two 
color readings for a spot g on the nth array by w and wffl, a plot of 

log(u4iP) against log(u4^P) for all g usually shows a curvilinear scatter 
of points, indicating that the two color fluors measure gene expression 
differently. The authors show that the following affine transformation 
may eliminate the nonlinearity in the scatter plot and put the two colors 
on an equal footing. 



ygn = log 



( W 9n 

Uj? 




(6.5) 



In this case, the offset parameter a n is chosen separately for each array n 
and then applied across all genes on the array. The red (R) and green (G) 
color readings can be reversed in calculating this log-ratio of intensities. 
Observe that the offset parameter tends to keep the average intensity 
of the combined colors roughly unchanged but shifts one color reading 
relative to the other. The shift is most pronounced for low intensities. 
The authors propose that parameter a n be chosen to minimize the sum of 
the absolute deviations of the y gn observations from their median value 
for all genes g. This fitting criterion has the effect of giving a nearly 
horizontal scatter in an MA plot for the two transformed intensities. 
See section 6.2.4 for a discussion of such plots. 

The effect of the affine adjustment on the transformed expression in- 
tensities may need to be explored in different applications. For example, 
with a transformation y — log(a + w), the first derivative dy/da equals 
1 /(a + w) for natural logarithms. This fact tells us that a small change 
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in the shift parameter a will have a small impact on large expression 
values because 1 /{a + w) declines rapidly with increasing w but will be 
large when a + w is close to zero. 



6.1.5. The Generalized-log Transformation 

Extending a model introduced by Rocke and Lorenzato 2 (1995) and 
Rocke and Durbin 3 (2001), Durbin et al . 4 (2002) propose a two-component 
error model for gene expression data from microarrays. Let w denote 
the measured raw expression level, /z the true expression level, and b the 
mean background noise (mean intensity of unexpressed genes). They 
demonstrate that the measured expression levels from microarray data 
can be modelled as 

w = b + ne v + e (6.6) 

where 77 ~ iV(0,<7 2 ) represents the proportional error that always exists, 
but is noticeable mainly for highly expressed genes, and e ~ N(0,a 2 ) 
represents the error for the background noise. Random variables 77 and 
e are taken as independent. 

Under this model, the variance of the measured intensity w at true 
level /z is given by 

Var (w) = /z 2 S n 2 + a 2 (6.7) 

where S v = \Je°^ (e CTr > 2 — 1) denotes the approximate relative standard 
deviation (RSD) of w for high levels of expression. 

At low expression levels (i.e., n close to 0), the measured expression 
can be written as 

w « b + e (6.8) 

implying that the measured expression in (6.8) is approximately nor- 
mally distributed with mean b and constant variance a 2 . When /z is 
large, the measured expression may be modelled by 

w & fi e 0 (6.9) 

On the log scale, (6.9) can be written as 

ln(tu) « ln(/z) + 77 (6.10) 

which implies that ln(ta) has constant variance for /z sufficiently large 
and that ln(«;) is distributed approximately as a normal random variable 
with mean ln(^z) and variance a 2 . All terms in (6.6) play a significant 
role only when the expression level /z falls between these two extremes. 
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The measured expression w is then distributed as a linear combination 
of a normal and a lognormal random variable and has variance (6.7). 
Therefore, the distribution of the measurement error changes depending 
on /r, i.e., the variance changes with the mean in a non-linear fashion. 

Durbin et al. (2002) use a mathematical procedure called the delta- 
method to derive the following transformation for microarray data that 
stabilizes the asymptotic variance over the full range of the measured 
data. 

y = h(w) = In [w — b + yj (w — b) 2 + c ] (6.11) 



Here c is the constant 



c = 



exp(cr^) - 1 



( 6 . 12 ) 



The transformation in (6.11) has the general form reported by the au- 
thors, but the expression for constant c in terms of the model parameters 
differs from that obtained in the cited paper. The two versions of the 
transformation differ little when cr* is small. This transformation sta- 
bilizes the asymptotic variance of data distributed according to model 
(6.6). For a large value of w , the transformation (6.11) is approximately 
the natural logarithm. At w near zero, the transformation (6.11) is ap- 
proximately linear. This transformation was considered by Hawkins 5 
(2002) in the context of another application. 

Geller et al. 6 (2003) show that data from Affymetrix GeneChips con- 
form to the same two-component model in Durbin et al. (2002). Huber 
et al. 7 (2002) also consider a family of transformations that is related to 
the generalized-log family. 



6.2. Data Normalization 

The laboratory preparation of each biological specimen n on a mi- 
croarray slide introduces an arbitrary scale or dilution factor that is 
common to gene expression readings w gn for all genes g. Analysts usu- 
ally correct the readings for this scale factor and other variations us- 
ing a process called normalization. The purpose of normalization is to 
minimize extraneous variation in the measured gene expression levels 
of hybridized mRNA samples so that biological differences (differential 
expression) can be more easily distinguished. 

Normalization is commonly based on the expression levels of all genes 
on an array. Assuming that most genes are similarly expressed in both 
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the test and the reference sample and that the number of upregulated 
genes largely matches the number of downregulated genes, normaliza- 
tion factors can be based on the total fluorescence, expression ratios, 
or regression analysis. Total fluorescence normalization assumes that 
approximately the same total amount of test and reference sample has 
hybridized, and thus the total fluorescence of both dyes used should be 
the same on the array. A normalization factor calculated from the ratio 
of the total fluorescence of the dyes can be used to re-scale the intensity 
of each gene on the array. 



Practical experience has shown that, in addition to array effects, other 
extraneous sources of variation may be present that cloud differential 
gene expression if not taken into account. These sources include many 
mentioned in Chapter 4, such as effects of dye color, pin tips, and spatial 
anomalies on slides. For example, an examination of two-color cDNA 
data sets shows that fluorescence of an array spot varies with the amount 
of hybridization in a different way for each dye color, i.e., depending on 
whether the fluor is red or green. These systematic color differences need 
to be taken into account. Pin-tip differences represent a potential source 
of variation. Spots are printed on the array by pins that are configured 
in a particular pattern for the experiment. As the pins are robotically 
controlled, the print pattern is reproduced across the arrays. To quote 
Yang et al. s (2002), pin -tip effects may result from “slight differences 
in the length or in the opening of the tips, and deformation after many 
hours of printing.” Some systematic variation may be associated with 
regions of the slide surface. Pin-tip differences are one regular source of 
spatial variation, but others may also be present. A slide may need to 
be partitioned into regions and adjustments carried out for each region. 



It will be demonstrated that normalization should encompass all ma- 
terial sources of extraneous variation. In essence, normalization gives ap- 
proximately the same ‘average’ level of gene expression across all genes 
for each combination of experimental conditions. Sometimes analysts 
also wish to scale expression readings for each gene g so that the ‘aver- 
age’ level of gene expression is the same across all conditions for each 
gene. The normalization procedures differ with respect to which kind of 
average is used and what sources of variability are taken into account. 
Several authors have given systematic discussions of normalization. See, 
for example, Yang et al (2002), among others. 
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6.2.1. Normalization Across G Genes 

In this section we give a general framework for normalization based on 
the Box-Cox family of transformations. For any transformed intensity 
y gn = h(w gn ) in the form of (6.4), the normalized value is given by the 
centered differences 



1 G 

Vgn ~ y +n = h(w gn ) - - £ h(w gn ) (6.13) 

U 9= 1 

where y+ n is the arithmetic mean value of the transformed intensity y gn 
over all genes g. Prom equation (6.13) and the definition of a mean, 
these transformed normalized readings necessarily sum to zero for each 
experimental condition, as follows. 

G 

Y (ygn - V+n) = 0 (6.14) 

5=1 

To relate the transformed mean y+ n to a measure of central loca- 
tion on the original scale of intensity (i.e., on the w scale), the inverse 
transformation h~ l is applied as follows, giving the measure w+ n : 

1 G 

w +n = h~ l (y +n ) = h~ 1 [-^Y h (wgn) ] (6.15) 

U 5=1 

For the special case of the identity transformation where d = 1 in 
(6.4), the normalized values of the transformed readings are simply the 
centered differences 



Vgn y — w gn dj+ n — W gn W+ n , 



where w+ n = w+ n is the arithmetic mean of the w gn intensity readings. 

For the limiting case where d approaches 0, the normalized values of 
the transformed readings take the logarithmic form 

Vgn - y+n = log Wgn - log w+ n = log (w gn /w+ n ), 

where w+ n is the geometric mean of the intensity reading w gn across 
genes. Often one considers the scale-normalized intensity readings de- 
fined by 



'Wgn/ W+n 
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Observe that, for any given array n, if all gene expression readings w gn 
are multiplied by an arbitrary positive constant, the scale-normalized 
readings would be unchanged for any of the Box-Cox transformations. 
Thus, the ratios eliminate any scaling factor that cuts across all genes 
within any experimental condition. 



6.2.2. Example: Mouse Juvenile Cystic Kidney Data Set 

A microarray data set introduced in Lee et al. 9 (2002b) will be used 
to illustrate normalization of data and also concepts and methods in 
following sections. This microarray data set was collected to investigate 
differential gene expression in kidney tissue from wild-type and mutant 
mice with juvenile cystic kidneys. Autosomal dominant polycystic kid- 
ney disease (ADPKD) is one of the most common monogenic diseases, 
characterized by the presence of multiple fluid-filled epithelial cysts in 
the kidneys. Animal models of polycystic kidney disease (PKD) are 
an ideal resource for investigating the perturbations of gene expression 
that occur in this disorder. Mouse mutants have the advantage of being 
maintained in inbred genetic strains, so variation of gene expression due 
to genetic background is minimized. Of particular significance is the 
opportunity to investigate differential expression that may be common 
to different models of PKD. These may represent molecular events that 
occur not as a result of a specific mutation but as a consequence of a 
more general injury to tubular integrity. As such, they may suggest 
pathways of molecular events that can be helpful for understanding the 
fundamental defect that occurs in PKD. Furthermore, they may suggest 
avenues of therapeutic intervention that can be investigated as a means 
to ameliorate disease progression. 

The experimental design in this study was chosen by the biological 
scientists in advance of any consideration of statistical issues. It is not 
ideal from a statistical viewpoint but is not atypical as far as microarray 
studies are concerned. In this design, eight readings were gathered for 
each gene in four microarray pairs, according to the pattern set out in 
Table 6.1. Array in Table 6.1 refers to the four microarray pairs (Arrays 
1 to 4). Channel refers to whether the expression reading comes from 
the Cy3 green fluor channel (Channel 1) or the Cy5 red fluor channel 
(Channel 2). Type refers to kidney tissues from two mouse species, mu- 
tant (Type 1) and wild-type (Type 2), as described in the next section. 
The inclusion of a color channel effect in the design was prompted by the 
scientists’ concern that gene expression profiles might differ by channel - 
a concern that turned out to be justified. The experimental design is not 
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Channel 1 
Cy3 (Green) 


Channel 2 
Cy5 (Red) 


Array 1 


mutant 


mutant 


Array 2 


mutant 


wild-type 


Array 3 


wild- type 


mutant 


Array 4 


wild- type 


wild- type 



Table 6.1. Experimental design for the Mouse Juvenile Cystic Kidney Data Set 



mutually orthogonal in the three factors Array , Channel, and Type. In 
particular, each array necessarily generates expression readings on both 
color channels. In this particular design, Channel is orthogonal to Array 
and Type, but the latter two factors are partially confounded. 

The data set in this experiment contains ScanAlyze 10 cDNA gene ex- 
pression data for 1,728 genes. All 1,728 cDNA clones were generated in- 
house. Approximately 1,152 murine cDNA clones came from the murine 
brain cDNA library and 576 from the rat brain cDNA library. All clone 
sequences were verified by single-pass 3’ sequencing. Each cDNA insert 
was amplified by PCR and was printed in duplicate on treated glass 
microscope slides using a Genetics Microsystems 417 arrayer in linear 
pattern. Averages of the duplicated readings were used in the analysis. 

In the notation of this book, the design in this study involves G - 
1, 728 genes and N - 8 experimental conditions. We now illustrate 
normalization for the Mouse Juvenile Cystic Kidney Data Set. The 
systematic effects of array and channel (dye color) will be removed in 
this normalization. Table 6.2 gives the geometric mean expression levels 
for the 1,728 genes for each array-channel combination in the microar- 
ray study. The expression levels are not background-corrected. The 
geometric mean corresponds to w +n for the logarithmic transformation 
(Box-Cox parameter d = 0). To illustrate a normalized reading, we note 
that the raw expression reading for gene 5137 for the mutant tissue in 
array 2 on the green channel is 2820. Thus, its scale-normalized read- 
ing for the log-transformation is 2820/3307 = 0.8527. The transformed 
normalized reading is log(2820) - log(3307) = log(0.8527) = -0.1593, 
using natural logarithms. Scale-normalized readings and transformed 
normalized readings for other genes in the data set would be calculated 
accordingly. The identity in (6.14) will hold for the transformed nor- 
malized readings of the 1,728 genes under each of the eight experimental 
conditions in this study. 
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Channel I 
Cy3 (Green) 


Geometric 

Mean 


Channel 2 
Cy5 (Red) 


Geometric 

Mean 


Array 1 


mutant 


2813 


mutant 


2042 


Array 2 


mutant 


3307 


wild- type 


2034 


Array 3 


wild-type 


2040 


mutant 


2593 


Array 4 


wikhtype 


1815 


wild- type 


1797 



Table 6.2. Geometric mean expression levels w +n for each combination n of array, 
channel and mouse type in the Mouse Juvenile Cystic Kidney Data Set 



Looking at the geometric means across the experimental conditions, 
we note that there is considerable variation in the mean values, the 
largest being almost double the smallest. Mean gene expression levels 
for mutant tissue tend to be larger than for wild-type tissue. Also, there 
is a suggestion that green levels are generally higher than red levels. 
This last observation is consistent with the remark in section 3.8 about 
differences in the Cy3 and Cy5 dyes. 



6.2.3. Normalization Across G Genes and N Samples 

Where normalization is to be carried out across both genes and ex- 
perimental conditions, the normalized value for transformed intensity 
reading y gn = h(w gn ) is of the form 



y gn - y+n - y g + + y++ 

I G j iV 1 N G 

— h(w gn ) — — ^ h(w gn ) — — ^ h(w gn ) + ^ ^ h(w gn ) 

g= 1 7i—l n=l 5=1 

Here y +n , y g+ and y+ + denote arithmetic means of transformed intensity 
readings taken over genes, conditions, and over all genes and conditions, 
respectively. It is clear that the transformed normalized readings in this 
case will sum to zero across both genes and experimental conditions. As 
we will show later, these transformed normalized readings give a rough 
picture of differential gene expression because they represent differences 
in gene expression across experimental conditions for each gene. 
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6.2.4. Color Effects and MA Plots 

Before continuing with the discussion of normalization, it is necessary 
to digress briefly to consider color effects and the relation of differential 
expression to average intensity levels. This discussion lays the ground- 
work for understanding their role in normalization. 

To start the discussion of color effects, we again denote the red and 
green intensity readings by u/ 7 *) and u/ G ). We suppress the subscripts 
for gene and experimental condition for the moment. To show that 
fluorescence of an array spot varies with the amount of hybridization in 
a different way for each color, we might plot logit/ 7 *) against logu/ G ). 
Many other authors have used similar graphs to demonstrate this point. 

In contrast, Dudoit et al. n (2000) and Yang et al. (2002) propose a 
plot of the log intensity ratio 

M = \og(w^ /w^) = logit/ 7 *) - logit/ G ) (6.16) 

against the average log-intensity 

A = log V it/ 7 *)!!^) = ^ (log it/ 7 *) + logit/ G )). (6.17) 

z 

A plot of M versus A, referred to as an MA plot, amounts to a 45° 
rotation of the (log a/ log w^) coordinate system, followed by scaling 
of the coordinates. Difference M captures the color difference on a log- 
scale and A represents the average log-intensity for the two colors. 

To demonstrate an MA plot and the color asymmetry, we consider 
data from the Mouse Juvenile Cystic Kidney Data Set. Figure 6.1 shows 
the difference M in normalized log-intensity for the red and green chan- 
nels plotted against the average normalized log intensity A for all genes. 
The normalization is that in (6.14), with h(w) = log(u-), which involves 
centering of the log-readings on each color channel. The raw data corre- 
spond to CH1I and (CH2I for Array 1 (mutant tissue) in Table 6.1 and, 
hence, are not background-corrected. 

The graph shows the zero lines for M and A (the log-averages). The 
graph also shows the smooth values of M as a LOWESS 12 -fitted function 
of A. The scatter plot of the points is concave upward. Since the gene 
expression readings for both colors are coming from the same mutant 
tissue, the curvature in the smooth fitted function illustrates the differ- 
ence in intensity response between the two colors as a function of the 
level of expression (as measured by average A). Note that the red color 
tends to be relatively more intense at lower and higher intensity levels 




